Configure the Fabric
You have the switches in the rack. They're cabled to servers and to each other. They're booted to factory defaults. This page is what you do next — the config that turns a pile of switches into a lossless RoCE v2 fabric.
The four phases:
- Management & underlay — get the switches reachable and talking to each other (BGP).
- QoS classification — make sure RoCE v2 traffic lands in the right queue.
- PFC + ECN — turn on lossless behavior on the RoCE priority.
- PFC watchdog + buffer profile — safety nets and tuning.
Vendor-specific syntax differs; the concepts are universal. This page shows Arista EOS and NVIDIA Spectrum (Cumulus / NVOS) side by side.
Phase 1: Management & BGP underlay
Get each switch a management IP and bring up eBGP between every leaf and every spine.
Arista EOS (per leaf):
hostname leaf-01
interface Management1
ip address 10.0.0.11/24
!
router bgp 65001
router-id 10.0.0.11
neighbor SPINES peer-group
neighbor SPINES remote-as 65000
neighbor SPINES bfd
neighbor 10.10.1.1 peer-group SPINES
neighbor 10.10.2.1 peer-group SPINES
neighbor 10.10.3.1 peer-group SPINES
neighbor 10.10.4.1 peer-group SPINES
maximum-paths 4 ecmp 4
address-family ipv4
neighbor SPINES activate
redistribute connected
NVIDIA Spectrum (Cumulus Linux 5.x — NVUE):
nv set router bgp autonomous-system 65001
nv set router bgp router-id 10.0.0.11
nv set router bgp neighbor 10.10.1.1 remote-as 65000
nv set router bgp neighbor 10.10.1.1 type external
nv set vrf default router bgp address-family ipv4-unicast maximum-paths ibgp 4
nv set vrf default router bgp address-family ipv4-unicast maximum-paths ebgp 4
nv config apply
Verify (both vendors):
show ip bgp summary
show ip route bgp
Look for all 4 spines as established neighbors, and ECMP routes for every leaf prefix. If ECMP isn't visible, the maximum-paths knob is wrong.
Phase 2: QoS classification
Classify RoCE v2 packets (UDP port 4791) into the lossless priority class. Convention: DSCP 26 → traffic class 3.
Arista EOS:
qos map dscp 26 to traffic-class 3
qos map traffic-class 3 to cos 3
!
class-map type qos match-any roce
match dscp 26
!
policy-map type qos roce-classify
class roce
set traffic-class 3
!
interface Ethernet1-32
service-policy type qos input roce-classify
service-policy type qos output roce-classify
NVIDIA Spectrum:
nv set qos roce mode lossless
nv set qos roce congestion-control ecn
nv set qos roce traffic-class 3
nv set qos roce dscp 26
nv config apply
Spectrum's nv set qos roce is a shortcut that configures the whole pipeline (DSCP → TC, PFC, ECN) with sensible defaults. EOS gives you finer-grained control but more knobs to remember.
Verify:
show qos interface Ethernet1/1
show qos counters
You should see traffic landing on TC3 when a RoCE-aware sender hits the port.
Phase 3: PFC + ECN on the RoCE priority
Now enable PFC pause and ECN WRED marking on TC3.
Arista EOS (per interface):
interface Ethernet1-32
priority-flow-control on
priority-flow-control priority 3 no-drop
priority-flow-control mode auto
priority-flow-control watchdog action drop timer 100
!
qos profile lossless
queue 3
ecn min 102400 max 1536000 probability 0.1
!
interface Ethernet1-32
service-policy type qos output lossless
That enables:
- PFC pauses on priority 3 (no-drop)
- PFC watchdog with 100 ms timeout (drops paused traffic if PFC is stuck — prevents fabric-wide deadlock)
- ECN WRED on TC3: mark CE between 100 KB and 1.5 MB queue depth, with up to 10% probability
NVIDIA Spectrum:
nv set qos roce pfc priority 3
nv set qos roce pfc mode lossless
nv set qos roce pfc watchdog timer 100
nv set qos roce ecn threshold-min 102400
nv set qos roce ecn threshold-max 1536000
nv set qos roce ecn probability 10
nv config apply
Same conceptual config, fewer commands. NVOS has more AI-fabric defaults baked in.
Verify PFC:
show priority-flow-control counters
show priority-flow-control interface Ethernet1/1
Expect:
- PFC
RXandTXcounters near zero in steady state (ECN should mostly prevent PFC) - Spikes during induced congestion tests (good — PFC is working)
Verify ECN:
show qos queue counters
show interface Ethernet1/1 counters | inc ECN
You should see ECN marked counter rising under load.
Phase 4: Buffer profile
Decide how much of the switch's shared buffer goes to lossless vs lossy traffic.
Default profiles (vendor-supplied "AI" profiles) work for most clusters:
- NVIDIA Spectrum-X:
nv set qos buffer-profile ai-roce— enables the bundled profile - Arista 7060X:
qos profile lossless— references the built-in lossless template
For custom tuning at scale:
| Setting | Typical | Effect |
|---|---|---|
| Lossless pool share | 40-60% of total buffer | More for RoCE; less for lossy |
| Headroom per port (per priority 3) | Auto-detected from cable length | Absorbs in-flight bytes after PAUSE |
| Min guarantee per port | ~50 KB | Prevents one port from starving others |
| Dynamic threshold (alpha) | 8 or 16 | How aggressively shared buffer reallocates |
Don't tune these on day one — they're for production debugging when AllReduce time variance is high.
The build checklist
After all four phases, verify the fabric end-to-end:
✓ All leaf-spine BGP sessions established
✓ ECMP routes visible (every leaf can reach every other via 4 spines)
✓ RoCE v2 traffic classifies to TC3 (verified with a generator)
✓ PFC priority 3 enabled on all ports
✓ ECN WRED on TC3 with sensible thresholds
✓ PFC watchdog enabled with 100 ms timeout
✓ Buffer profile loaded
✓ Counters baseline captured (so you know what "normal" looks like)
Skip the last step at your peril. Without a baseline, you'll spend hours debugging "is this counter normal?" the first time something goes wrong.
Vendor-specific gotchas
A short list of "things that bit me":
- Arista MLAG conflicts with BGP unnumbered — use one or the other.
- NVIDIA Spectrum-X has two QoS engines (Mellanox and standard). Use the Mellanox/roce one for RoCE — the other doesn't have PFC.
- Tomahawk-based white-box (Edgecore, Celestica) buffer profiles are very different from default Arista — read the SAI / SONiC docs carefully.
- Cisco Nexus PFC requires
priority-flow-control mode on—autodoesn't work consistently across firmware versions.
What you should remember
- Four config phases: underlay → classification → PFC+ECN → buffer profile.
- DSCP 26 → traffic-class 3 is the de-facto standard for RoCE v2 (use it; don't invent your own).
- PFC watchdog with 100 ms timeout prevents fabric-wide deadlock from runaway PFC storms.
- ECN WRED thresholds depend on switch buffer size. Reference profiles work; custom tuning is a production-debugging activity.
- Capture a baseline before going live. Counters at steady state, no traffic, then again under load.
Next: Configure the Hosts + Kubernetes → — BIOS, kernel, drivers, Operators, Multus, and the pod spec template.