Skip to main content

Configure the Fabric

You have the switches in the rack. They're cabled to servers and to each other. They're booted to factory defaults. This page is what you do next — the config that turns a pile of switches into a lossless RoCE v2 fabric.

The four phases:

  1. Management & underlay — get the switches reachable and talking to each other (BGP).
  2. QoS classification — make sure RoCE v2 traffic lands in the right queue.
  3. PFC + ECN — turn on lossless behavior on the RoCE priority.
  4. PFC watchdog + buffer profile — safety nets and tuning.

Vendor-specific syntax differs; the concepts are universal. This page shows four switch stacks side by side — Arista EOS, Cisco NX-OS (Nexus 9000), Juniper Junos (QFX/PTX), and NVIDIA Spectrum (Cumulus / NVUE). Pick your vendor in the tabs and it stays selected through every phase. The CLI changes; the lossless RoCE v2 recipe — classify on DSCP 26, PFC on priority 3, ECN before PFC — does not.

After this page, you'll be able to
  1. Build the BGP underlay — bring up eBGP from every leaf to all 4 spines, set maximum-paths 4 ecmp 4, and verify ECMP routes with show ip bgp summary / show ip route bgp.
  2. Classify RoCE v2 into the lossless class — map DSCP 26 → traffic-class 3 (the de-facto standard) and confirm traffic lands on TC3 in both EOS and nv set qos roce.
  3. Configure PFC + ECN on TC3 — enable priority 3 no-drop, set ECN WRED thresholds (min 100 KB, max 1.5 MB, 10% probability), and arm the PFC watchdog with a 100 ms timeout against fabric-wide deadlock.
  4. Validate the fabric end-to-end — size the buffer profile (40–60% lossless pool), walk the build checklist, and capture an idle-and-loaded counter baseline before going live.

Phase 1: Management & BGP underlay

Get each switch a management IP and bring up eBGP between every leaf and every spine.

Watch the whole leaf-side BGP underlay come up on the rockynet lab simulator — verify uplinks UP via LLDP, configure router bgp 65001 with both spine peers, activate the address family, then show bgp summary showing both sessions Estab and show ip route bgp showing rail prefixes learned via both spines (ECMP-ready):

MODULE cluster-build-guide · LAB 1Watch the recording — every command, every counter, every output.
hostname leaf-01
interface Management1
ip address 10.0.0.11/24
!
router bgp 65001
router-id 10.0.0.11
neighbor SPINES peer-group
neighbor SPINES remote-as 65000
neighbor SPINES bfd
neighbor 10.10.1.1 peer-group SPINES
neighbor 10.10.2.1 peer-group SPINES
neighbor 10.10.3.1 peer-group SPINES
neighbor 10.10.4.1 peer-group SPINES
maximum-paths 4 ecmp 4
address-family ipv4
neighbor SPINES activate
redistribute connected

Verify (both vendors):

show ip bgp summary
show ip route bgp

Look for all 4 spines as established neighbors, and ECMP routes for every leaf prefix. If ECMP isn't visible, the maximum-paths knob is wrong.


Phase 2: QoS classification

Classify RoCE v2 packets (UDP port 4791) into the lossless priority class. Convention: DSCP 26 → traffic class 3.

qos map dscp 26 to traffic-class 3
qos map traffic-class 3 to cos 3
!
class-map type qos match-any roce
match dscp 26
!
policy-map type qos roce-classify
class roce
set traffic-class 3
!
interface Ethernet1-32
service-policy type qos input roce-classify
service-policy type qos output roce-classify

Verify:

show qos interface Ethernet1/1
show qos counters

You should see traffic landing on TC3 when a RoCE-aware sender hits the port.


Phase 3: PFC + ECN on the RoCE priority

Now enable PFC pause and ECN WRED marking on TC3.

Every vendor expresses the same three things: PFC no-drop on priority 3, an ECN WRED band (mark CE between ~100 KB and ~1.5 MB of queue depth, up to 10% probability), and a PFC watchdog (~100 ms) that breaks a stuck pause before it deadlocks the fabric.

interface Ethernet1-32
priority-flow-control on
priority-flow-control priority 3 no-drop
priority-flow-control mode auto
priority-flow-control watchdog action drop timer 100
!
qos profile lossless
queue 3
ecn min 102400 max 1536000 probability 0.1
!
interface Ethernet1-32
service-policy type qos output lossless

Verify PFC:

show priority-flow-control counters
show priority-flow-control interface Ethernet1/1

Expect:

  • PFC RX and TX counters near zero in steady state (ECN should mostly prevent PFC)
  • Spikes during induced congestion tests (good — PFC is working)

Verify ECN:

show qos queue counters
show interface Ethernet1/1 counters | inc ECN

You should see ECN marked counter rising under load.


Phase 4: Buffer profile

Decide how much of the switch's shared buffer goes to lossless vs lossy traffic.

Default profiles (vendor-supplied "AI" profiles) work for most clusters:

  • NVIDIA Spectrum-X: nv set qos buffer-profile ai-roce — enables the bundled profile
  • Arista 7060X: qos profile lossless — references the built-in lossless template
  • Cisco Nexus 9000: tune the no-drop pool under policy-map type network-qos (queue-limit / dynamic threshold) — there is no one-line "AI profile"
  • Juniper QFX/PTX: size the lossless buffer with shared-buffer + buffer-size under class-of-service, plus per-class buffer-size percent

For custom tuning at scale:

SettingTypicalEffect
Lossless pool share40-60% of total bufferMore for RoCE; less for lossy
Headroom per port (per priority 3)Auto-detected from cable lengthAbsorbs in-flight bytes after PAUSE
Min guarantee per port~50 KBPrevents one port from starving others
Dynamic threshold (alpha)8 or 16How aggressively shared buffer reallocates

Don't tune these on day one — they're for production debugging when AllReduce time variance is high.


The build checklist

After all four phases, verify the fabric end-to-end:

✓ All leaf-spine BGP sessions established
✓ ECMP routes visible (every leaf can reach every other via 4 spines)
✓ RoCE v2 traffic classifies to TC3 (verified with a generator)
✓ PFC priority 3 enabled on all ports
✓ ECN WRED on TC3 with sensible thresholds
✓ PFC watchdog enabled with 100 ms timeout
✓ Buffer profile loaded
✓ Counters baseline captured (so you know what "normal" looks like)

Skip the last step at your peril. Without a baseline, you'll spend hours debugging "is this counter normal?" the first time something goes wrong.


Vendor-specific gotchas

A short list of "things that bit me":

  • Arista MLAG conflicts with BGP unnumbered — use one or the other.
  • NVIDIA Spectrum-X has two QoS engines (Mellanox and standard). Use the Mellanox/roce one for RoCE — the other doesn't have PFC.
  • Tomahawk-based white-box (Edgecore, Celestica) buffer profiles are very different from default Arista — read the SAI / SONiC docs carefully.
  • Cisco Nexus PFC requires priority-flow-control mode onauto negotiates PFC over DCBX/LLDP and behaves inconsistently across NIC firmware and NX-OS releases. PFC + ECN also live in separate policy types (network-qos vs queuing); configuring one and forgetting the other is the #1 NX-OS RoCE bug.
  • Cisco queue-class name depends on the queuing model — modern 9300-FX/GX default to the 8-queue model, where the class is c-out-8q-q3 (the legacy 4-queue name c-out-q3 is rejected). qos-group N maps to queue N implicitly — there's no explicit map command. random-detect … ecn needs drop-probability/weight to be accepted, and a type network-qos policy is system-global only (system qos, never per-interface; max 2 no-drop classes). Jumbo mtu 9216 on the no-drop class must match end to end.
  • Juniper ECMP needs a forwarding-table load-balance per-packet export policy — BGP multipath alone selects paths but installs only one next hop. "Per-packet" is a legacy name; it actually hashes per-flow. If your spines use different AS numbers, you also need multipath multiple-as.
  • Juniper PFC binds to the 802.1p code-point, not the DSCP — make sure your classifier's forwarding-class and the congestion-notification-profile code-point both resolve to priority 3, or PFC pauses the wrong queue. (Newer QFX5220/5240 and PTX also support DSCP-based PFC for RoCEv2 — check your platform.)
  • Juniper CoS examples show one interface (et-0/0/0) — apply the classifier, PFC profile, and scheduler-map to every fabric-facing interface. Use an interfaces interface-range or apply-groups rather than the per-interface lines shown, so you don't drift between ports.
These are reference configs — version- and platform-sensitive

Queue-class names (c-out-8q-q3 vs c-out-q3), WRED/ECN keywords, and PFC syntax vary by Nexus 9000 model (9300-FX/GX vs 9500) and NX-OS release; Junos CoS for RoCEv2 differs across QFX5120 / 5220 / 5240 / PTX and not every platform supports DCQCN. Validate against your switch's current guide before applying:

  • Cisco — RoCE Storage Implementation over NX-OS (end-to-end lossless reference config) + the Nexus 9000 NX-OS QoS Configuration Guide (Priority Flow Control / Network QoS chapters).
  • Juniper — Class of Service User Guide → "CoS for RoCEv2" / "Configuring DCQCN" (QFX/PTX), on the Juniper TechLibrary.
  • Arista — the EOS RoCE / AI Center deployment guide. NVIDIA — the Spectrum-X / NVUE RoCE reference.

💡 What you should remember

#ConceptWhy it matters
14️⃣Four config phases:underlay → classification → PFC+ECN → buffer profile.
2🏷️DSCP 26 → traffic-class 3is the de-facto standard for RoCE v2 (use it; don't invent your own).
3🚫PFC watchdog with 100 ms timeoutprevents fabric-wide deadlock from runaway PFC storms.
4📦ECN WRED thresholds depend on switch buffer size.Reference profiles work; custom tuning is a production-debugging activity.
5📊Capture a baseline before going live.Counters at steady state, no traffic, then again under load.

Next: Configure the Hosts + Kubernetes → — BIOS, kernel, drivers, Operators, Multus, and the pod spec template.