Configure the Fabric
You have the switches in the rack. They're cabled to servers and to each other. They're booted to factory defaults. This page is what you do next — the config that turns a pile of switches into a lossless RoCE v2 fabric.
The four phases:
- Management & underlay — get the switches reachable and talking to each other (BGP).
- QoS classification — make sure RoCE v2 traffic lands in the right queue.
- PFC + ECN — turn on lossless behavior on the RoCE priority.
- PFC watchdog + buffer profile — safety nets and tuning.
Vendor-specific syntax differs; the concepts are universal. This page shows four switch stacks side by side — Arista EOS, Cisco NX-OS (Nexus 9000), Juniper Junos (QFX/PTX), and NVIDIA Spectrum (Cumulus / NVUE). Pick your vendor in the tabs and it stays selected through every phase. The CLI changes; the lossless RoCE v2 recipe — classify on DSCP 26, PFC on priority 3, ECN before PFC — does not.
- Build the BGP underlay — bring up eBGP from every leaf to all 4 spines, set
maximum-paths 4 ecmp 4, and verify ECMP routes withshow ip bgp summary/show ip route bgp. - Classify RoCE v2 into the lossless class — map
DSCP 26 → traffic-class 3(the de-facto standard) and confirm traffic lands on TC3 in both EOS andnv set qos roce. - Configure PFC + ECN on TC3 — enable
priority 3 no-drop, set ECN WRED thresholds (min 100 KB, max 1.5 MB, 10% probability), and arm the PFC watchdog with a 100 ms timeout against fabric-wide deadlock. - Validate the fabric end-to-end — size the buffer profile (40–60% lossless pool), walk the build checklist, and capture an idle-and-loaded counter baseline before going live.
Phase 1: Management & BGP underlay
Get each switch a management IP and bring up eBGP between every leaf and every spine.
Watch the whole leaf-side BGP underlay come up on the rockynet lab simulator — verify uplinks UP via LLDP, configure router bgp 65001 with both spine peers, activate the address family, then show bgp summary showing both sessions Estab and show ip route bgp showing rail prefixes learned via both spines (ECMP-ready):
- 1. Arista EOS
- 2. Cisco NX-OS
- 3. Juniper Junos
- 4. NVIDIA Spectrum
hostname leaf-01
interface Management1
ip address 10.0.0.11/24
!
router bgp 65001
router-id 10.0.0.11
neighbor SPINES peer-group
neighbor SPINES remote-as 65000
neighbor SPINES bfd
neighbor 10.10.1.1 peer-group SPINES
neighbor 10.10.2.1 peer-group SPINES
neighbor 10.10.3.1 peer-group SPINES
neighbor 10.10.4.1 peer-group SPINES
maximum-paths 4 ecmp 4
address-family ipv4
neighbor SPINES activate
redistribute connected
hostname leaf-01
feature bgp
feature bfd
interface mgmt0
ip address 10.0.0.11/24
!
router bgp 65001
router-id 10.0.0.11
address-family ipv4 unicast
maximum-paths 4
template peer SPINES
remote-as 65000
bfd
address-family ipv4 unicast
neighbor 10.10.1.1
inherit peer SPINES
neighbor 10.10.2.1
inherit peer SPINES
neighbor 10.10.3.1
inherit peer SPINES
neighbor 10.10.4.1
inherit peer SPINES
On NX-OS, maximum-paths lives under the address-family, and the peer template is the equivalent of an EOS peer-group.
set system host-name leaf-01
set interfaces em0 unit 0 family inet address 10.0.0.11/24
set routing-options autonomous-system 65001
set routing-options router-id 10.0.0.11
set protocols bgp group SPINES type external
set protocols bgp group SPINES peer-as 65000
set protocols bgp group SPINES bfd-liveness-detection minimum-interval 300
set protocols bgp group SPINES multipath
set protocols bgp group SPINES neighbor 10.10.1.1
set protocols bgp group SPINES neighbor 10.10.2.1
set protocols bgp group SPINES neighbor 10.10.3.1
set protocols bgp group SPINES neighbor 10.10.4.1
# ECMP needs a forwarding-table load-balance policy, or only one next hop installs:
set policy-options policy-statement ECMP then load-balance per-packet
set routing-options forwarding-table export ECMP
The Junos gotcha that bites everyone: multipath lets BGP select multiple paths, but they only get installed in the forwarding table when you export a load-balance per-packet policy. Miss it and you get one path despite four sessions.
nv set router bgp autonomous-system 65001
nv set router bgp router-id 10.0.0.11
nv set router bgp neighbor 10.10.1.1 remote-as 65000
nv set router bgp neighbor 10.10.1.1 type external
nv set vrf default router bgp address-family ipv4-unicast maximum-paths ibgp 4
nv set vrf default router bgp address-family ipv4-unicast maximum-paths ebgp 4
nv config apply
Verify (both vendors):
show ip bgp summary
show ip route bgp
Look for all 4 spines as established neighbors, and ECMP routes for every leaf prefix. If ECMP isn't visible, the maximum-paths knob is wrong.
Phase 2: QoS classification
Classify RoCE v2 packets (UDP port 4791) into the lossless priority class. Convention: DSCP 26 → traffic class 3.
- 1. Arista EOS
- 2. Cisco NX-OS
- 3. Juniper Junos
- 4. NVIDIA Spectrum
qos map dscp 26 to traffic-class 3
qos map traffic-class 3 to cos 3
!
class-map type qos match-any roce
match dscp 26
!
policy-map type qos roce-classify
class roce
set traffic-class 3
!
interface Ethernet1-32
service-policy type qos input roce-classify
service-policy type qos output roce-classify
class-map type qos match-any ROCE
match dscp 26
!
policy-map type qos ROCE-CLASSIFY
class ROCE
set qos-group 3
!
interface Ethernet1/1-32
service-policy type qos input ROCE-CLASSIFY
NX-OS classifies into an internal qos-group (not a traffic-class), and the no-drop / ECN behavior is attached to that group later in a type network-qos policy. The DSCP-to-group match is the same idea as EOS.
# DSCP 26 = code-point 011010
set class-of-service classifiers dscp ROCE forwarding-class no-loss loss-priority low code-points 011010
set class-of-service forwarding-classes class no-loss queue-num 3
set class-of-service interfaces et-0/0/0 unit 0 classifiers dscp ROCE
Junos maps the DSCP code-point to a forwarding-class bound to queue 3, then applies the classifier to every fabric-facing interface. The no-loss forwarding-class is what PFC and ECN hang off of in the next phase.
nv set qos roce mode lossless
nv set qos roce congestion-control ecn
nv set qos roce traffic-class 3
nv set qos roce dscp 26
nv config apply
Spectrum's nv set qos roce is a shortcut that configures the whole pipeline (DSCP → TC, PFC, ECN) with sensible defaults. The other three vendors give you finer-grained control but more knobs to remember.
Verify:
show qos interface Ethernet1/1
show qos counters
You should see traffic landing on TC3 when a RoCE-aware sender hits the port.
Phase 3: PFC + ECN on the RoCE priority
Now enable PFC pause and ECN WRED marking on TC3.
Every vendor expresses the same three things: PFC no-drop on priority 3, an ECN WRED band (mark CE between ~100 KB and ~1.5 MB of queue depth, up to 10% probability), and a PFC watchdog (~100 ms) that breaks a stuck pause before it deadlocks the fabric.
- 1. Arista EOS
- 2. Cisco NX-OS
- 3. Juniper Junos
- 4. NVIDIA Spectrum
interface Ethernet1-32
priority-flow-control on
priority-flow-control priority 3 no-drop
priority-flow-control mode auto
priority-flow-control watchdog action drop timer 100
!
qos profile lossless
queue 3
ecn min 102400 max 1536000 probability 0.1
!
interface Ethernet1-32
service-policy type qos output lossless
class-map type network-qos ROCE-NQ
match qos-group 3
!
policy-map type network-qos ROCE-NQ-POLICY
class type network-qos ROCE-NQ
pause pfc-cos 3
mtu 9216
!
policy-map type queuing ROCE-OUT
class type queuing c-out-8q-q3
random-detect minimum-threshold 100 kbytes maximum-threshold 1500 kbytes drop-probability 7 weight 0 ecn
!
system qos
service-policy type network-qos ROCE-NQ-POLICY
!
interface Ethernet1/1-32
priority-flow-control mode on
service-policy type queuing output ROCE-OUT
The NX-OS trap from the audit: use priority-flow-control mode on, not auto — auto negotiates inconsistently across firmware. PFC no-drop is set in the network-qos policy (pause pfc-cos 3); ECN is the ecn keyword on the egress random-detect.
# PFC on the no-loss class (802.1p code-point 011 = priority 3)
set class-of-service congestion-notification-profile ROCE-PFC input ieee-802.1 code-point 011 pfc
set class-of-service interfaces et-0/0/0 congestion-notification-profile ROCE-PFC
# ECN WRED band on the no-loss scheduler
set class-of-service drop-profiles ROCE-ECN interpolate fill-level [ 20 80 ] drop-probability [ 0 10 ]
set class-of-service schedulers ROCE-SCHED drop-profile-map loss-priority low protocol any drop-profile ROCE-ECN
set class-of-service schedulers ROCE-SCHED explicit-congestion-notification
set class-of-service scheduler-maps ROCE-MAP forwarding-class no-loss scheduler ROCE-SCHED
set class-of-service interfaces et-0/0/0 scheduler-map ROCE-MAP
Junos turns ECN on with explicit-congestion-notification on the scheduler, and the fill-level/drop-probability interpolation is the WRED band. PFC is a congestion-notification-profile applied to the interface. The watchdog equivalent is the PFC hold-time / watchdog knob (release-dependent).
nv set qos roce pfc priority 3
nv set qos roce pfc mode lossless
nv set qos roce pfc watchdog timer 100
nv set qos roce ecn threshold-min 102400
nv set qos roce ecn threshold-max 1536000
nv set qos roce ecn probability 10
nv config apply
Same conceptual config, fewest commands. NVUE has the most AI-fabric defaults baked in.
Verify PFC:
show priority-flow-control counters
show priority-flow-control interface Ethernet1/1
Expect:
- PFC
RXandTXcounters near zero in steady state (ECN should mostly prevent PFC) - Spikes during induced congestion tests (good — PFC is working)
Verify ECN:
show qos queue counters
show interface Ethernet1/1 counters | inc ECN
You should see ECN marked counter rising under load.
Phase 4: Buffer profile
Decide how much of the switch's shared buffer goes to lossless vs lossy traffic.
Default profiles (vendor-supplied "AI" profiles) work for most clusters:
- NVIDIA Spectrum-X:
nv set qos buffer-profile ai-roce— enables the bundled profile - Arista 7060X:
qos profile lossless— references the built-in lossless template - Cisco Nexus 9000: tune the no-drop pool under
policy-map type network-qos(queue-limit/ dynamic threshold) — there is no one-line "AI profile" - Juniper QFX/PTX: size the lossless buffer with
shared-buffer+buffer-sizeunderclass-of-service, plus per-classbuffer-size percent
For custom tuning at scale:
| Setting | Typical | Effect |
|---|---|---|
| Lossless pool share | 40-60% of total buffer | More for RoCE; less for lossy |
| Headroom per port (per priority 3) | Auto-detected from cable length | Absorbs in-flight bytes after PAUSE |
| Min guarantee per port | ~50 KB | Prevents one port from starving others |
| Dynamic threshold (alpha) | 8 or 16 | How aggressively shared buffer reallocates |
Don't tune these on day one — they're for production debugging when AllReduce time variance is high.
The build checklist
After all four phases, verify the fabric end-to-end:
✓ All leaf-spine BGP sessions established
✓ ECMP routes visible (every leaf can reach every other via 4 spines)
✓ RoCE v2 traffic classifies to TC3 (verified with a generator)
✓ PFC priority 3 enabled on all ports
✓ ECN WRED on TC3 with sensible thresholds
✓ PFC watchdog enabled with 100 ms timeout
✓ Buffer profile loaded
✓ Counters baseline captured (so you know what "normal" looks like)
Skip the last step at your peril. Without a baseline, you'll spend hours debugging "is this counter normal?" the first time something goes wrong.
Vendor-specific gotchas
A short list of "things that bit me":
- Arista MLAG conflicts with BGP unnumbered — use one or the other.
- NVIDIA Spectrum-X has two QoS engines (Mellanox and standard). Use the Mellanox/roce one for RoCE — the other doesn't have PFC.
- Tomahawk-based white-box (Edgecore, Celestica) buffer profiles are very different from default Arista — read the SAI / SONiC docs carefully.
- Cisco Nexus PFC requires
priority-flow-control mode on—autonegotiates PFC over DCBX/LLDP and behaves inconsistently across NIC firmware and NX-OS releases. PFC + ECN also live in separate policy types (network-qosvsqueuing); configuring one and forgetting the other is the #1 NX-OS RoCE bug. - Cisco queue-class name depends on the queuing model — modern 9300-FX/GX default to the 8-queue model, where the class is
c-out-8q-q3(the legacy 4-queue namec-out-q3is rejected).qos-group Nmaps to queue N implicitly — there's no explicit map command.random-detect … ecnneedsdrop-probability/weightto be accepted, and atype network-qospolicy is system-global only (system qos, never per-interface; max 2 no-drop classes). Jumbomtu 9216on the no-drop class must match end to end. - Juniper ECMP needs a forwarding-table
load-balance per-packetexport policy — BGPmultipathalone selects paths but installs only one next hop. "Per-packet" is a legacy name; it actually hashes per-flow. If your spines use different AS numbers, you also needmultipath multiple-as. - Juniper PFC binds to the 802.1p code-point, not the DSCP — make sure your classifier's forwarding-class and the
congestion-notification-profilecode-point both resolve to priority 3, or PFC pauses the wrong queue. (Newer QFX5220/5240 and PTX also support DSCP-based PFC for RoCEv2 — check your platform.) - Juniper CoS examples show one interface (
et-0/0/0) — apply the classifier, PFC profile, and scheduler-map to every fabric-facing interface. Use aninterfaces interface-rangeorapply-groupsrather than the per-interface lines shown, so you don't drift between ports.
Queue-class names (c-out-8q-q3 vs c-out-q3), WRED/ECN keywords, and PFC syntax vary by Nexus 9000 model (9300-FX/GX vs 9500) and NX-OS release; Junos CoS for RoCEv2 differs across QFX5120 / 5220 / 5240 / PTX and not every platform supports DCQCN. Validate against your switch's current guide before applying:
- Cisco — RoCE Storage Implementation over NX-OS (end-to-end lossless reference config) + the Nexus 9000 NX-OS QoS Configuration Guide (Priority Flow Control / Network QoS chapters).
- Juniper — Class of Service User Guide → "CoS for RoCEv2" / "Configuring DCQCN" (QFX/PTX), on the Juniper TechLibrary.
- Arista — the EOS RoCE / AI Center deployment guide. NVIDIA — the Spectrum-X / NVUE RoCE reference.
💡 What you should remember
| # | Concept | Why it matters | |
|---|---|---|---|
| 1 | 4️⃣ | Four config phases: | underlay → classification → PFC+ECN → buffer profile. |
| 2 | 🏷️ | DSCP 26 → traffic-class 3 | is the de-facto standard for RoCE v2 (use it; don't invent your own). |
| 3 | 🚫 | PFC watchdog with 100 ms timeout | prevents fabric-wide deadlock from runaway PFC storms. |
| 4 | 📦 | ECN WRED thresholds depend on switch buffer size. | Reference profiles work; custom tuning is a production-debugging activity. |
| 5 | 📊 | Capture a baseline before going live. | Counters at steady state, no traffic, then again under load. |
Next: Configure the Hosts + Kubernetes → — BIOS, kernel, drivers, Operators, Multus, and the pod spec template.