Skip to main content

Configure the Fabric

You have the switches in the rack. They're cabled to servers and to each other. They're booted to factory defaults. This page is what you do next — the config that turns a pile of switches into a lossless RoCE v2 fabric.

The four phases:

  1. Management & underlay — get the switches reachable and talking to each other (BGP).
  2. QoS classification — make sure RoCE v2 traffic lands in the right queue.
  3. PFC + ECN — turn on lossless behavior on the RoCE priority.
  4. PFC watchdog + buffer profile — safety nets and tuning.

Vendor-specific syntax differs; the concepts are universal. This page shows Arista EOS and NVIDIA Spectrum (Cumulus / NVOS) side by side.


Phase 1: Management & BGP underlay

Get each switch a management IP and bring up eBGP between every leaf and every spine.

Arista EOS (per leaf):

hostname leaf-01
interface Management1
ip address 10.0.0.11/24
!
router bgp 65001
router-id 10.0.0.11
neighbor SPINES peer-group
neighbor SPINES remote-as 65000
neighbor SPINES bfd
neighbor 10.10.1.1 peer-group SPINES
neighbor 10.10.2.1 peer-group SPINES
neighbor 10.10.3.1 peer-group SPINES
neighbor 10.10.4.1 peer-group SPINES
maximum-paths 4 ecmp 4
address-family ipv4
neighbor SPINES activate
redistribute connected

NVIDIA Spectrum (Cumulus Linux 5.x — NVUE):

nv set router bgp autonomous-system 65001
nv set router bgp router-id 10.0.0.11
nv set router bgp neighbor 10.10.1.1 remote-as 65000
nv set router bgp neighbor 10.10.1.1 type external
nv set vrf default router bgp address-family ipv4-unicast maximum-paths ibgp 4
nv set vrf default router bgp address-family ipv4-unicast maximum-paths ebgp 4
nv config apply

Verify (both vendors):

show ip bgp summary
show ip route bgp

Look for all 4 spines as established neighbors, and ECMP routes for every leaf prefix. If ECMP isn't visible, the maximum-paths knob is wrong.


Phase 2: QoS classification

Classify RoCE v2 packets (UDP port 4791) into the lossless priority class. Convention: DSCP 26 → traffic class 3.

Arista EOS:

qos map dscp 26 to traffic-class 3
qos map traffic-class 3 to cos 3
!
class-map type qos match-any roce
match dscp 26
!
policy-map type qos roce-classify
class roce
set traffic-class 3
!
interface Ethernet1-32
service-policy type qos input roce-classify
service-policy type qos output roce-classify

NVIDIA Spectrum:

nv set qos roce mode lossless
nv set qos roce congestion-control ecn
nv set qos roce traffic-class 3
nv set qos roce dscp 26
nv config apply

Spectrum's nv set qos roce is a shortcut that configures the whole pipeline (DSCP → TC, PFC, ECN) with sensible defaults. EOS gives you finer-grained control but more knobs to remember.

Verify:

show qos interface Ethernet1/1
show qos counters

You should see traffic landing on TC3 when a RoCE-aware sender hits the port.


Phase 3: PFC + ECN on the RoCE priority

Now enable PFC pause and ECN WRED marking on TC3.

Arista EOS (per interface):

interface Ethernet1-32
priority-flow-control on
priority-flow-control priority 3 no-drop
priority-flow-control mode auto
priority-flow-control watchdog action drop timer 100
!
qos profile lossless
queue 3
ecn min 102400 max 1536000 probability 0.1
!
interface Ethernet1-32
service-policy type qos output lossless

That enables:

  • PFC pauses on priority 3 (no-drop)
  • PFC watchdog with 100 ms timeout (drops paused traffic if PFC is stuck — prevents fabric-wide deadlock)
  • ECN WRED on TC3: mark CE between 100 KB and 1.5 MB queue depth, with up to 10% probability

NVIDIA Spectrum:

nv set qos roce pfc priority 3
nv set qos roce pfc mode lossless
nv set qos roce pfc watchdog timer 100
nv set qos roce ecn threshold-min 102400
nv set qos roce ecn threshold-max 1536000
nv set qos roce ecn probability 10
nv config apply

Same conceptual config, fewer commands. NVOS has more AI-fabric defaults baked in.

Verify PFC:

show priority-flow-control counters
show priority-flow-control interface Ethernet1/1

Expect:

  • PFC RX and TX counters near zero in steady state (ECN should mostly prevent PFC)
  • Spikes during induced congestion tests (good — PFC is working)

Verify ECN:

show qos queue counters
show interface Ethernet1/1 counters | inc ECN

You should see ECN marked counter rising under load.


Phase 4: Buffer profile

Decide how much of the switch's shared buffer goes to lossless vs lossy traffic.

Default profiles (vendor-supplied "AI" profiles) work for most clusters:

  • NVIDIA Spectrum-X: nv set qos buffer-profile ai-roce — enables the bundled profile
  • Arista 7060X: qos profile lossless — references the built-in lossless template

For custom tuning at scale:

SettingTypicalEffect
Lossless pool share40-60% of total bufferMore for RoCE; less for lossy
Headroom per port (per priority 3)Auto-detected from cable lengthAbsorbs in-flight bytes after PAUSE
Min guarantee per port~50 KBPrevents one port from starving others
Dynamic threshold (alpha)8 or 16How aggressively shared buffer reallocates

Don't tune these on day one — they're for production debugging when AllReduce time variance is high.


The build checklist

After all four phases, verify the fabric end-to-end:

✓ All leaf-spine BGP sessions established
✓ ECMP routes visible (every leaf can reach every other via 4 spines)
✓ RoCE v2 traffic classifies to TC3 (verified with a generator)
✓ PFC priority 3 enabled on all ports
✓ ECN WRED on TC3 with sensible thresholds
✓ PFC watchdog enabled with 100 ms timeout
✓ Buffer profile loaded
✓ Counters baseline captured (so you know what "normal" looks like)

Skip the last step at your peril. Without a baseline, you'll spend hours debugging "is this counter normal?" the first time something goes wrong.


Vendor-specific gotchas

A short list of "things that bit me":

  • Arista MLAG conflicts with BGP unnumbered — use one or the other.
  • NVIDIA Spectrum-X has two QoS engines (Mellanox and standard). Use the Mellanox/roce one for RoCE — the other doesn't have PFC.
  • Tomahawk-based white-box (Edgecore, Celestica) buffer profiles are very different from default Arista — read the SAI / SONiC docs carefully.
  • Cisco Nexus PFC requires priority-flow-control mode onauto doesn't work consistently across firmware versions.

What you should remember

  • Four config phases: underlay → classification → PFC+ECN → buffer profile.
  • DSCP 26 → traffic-class 3 is the de-facto standard for RoCE v2 (use it; don't invent your own).
  • PFC watchdog with 100 ms timeout prevents fabric-wide deadlock from runaway PFC storms.
  • ECN WRED thresholds depend on switch buffer size. Reference profiles work; custom tuning is a production-debugging activity.
  • Capture a baseline before going live. Counters at steady state, no traffic, then again under load.

Next: Configure the Hosts + Kubernetes → — BIOS, kernel, drivers, Operators, Multus, and the pod spec template.