Skip to main content

Fabric Load Balancing & Link Failover

How an AI fabric balances flows across spines, and what happens — second by second — when a spine link drops. Seven states, every layer named.

The 8 load-balancing techniques

AI fabrics rarely use just one. ECMP is the default, but the other seven exist because ECMP alone is brittle under synchronised AllReduce traffic. Production designs combine ECMP + adaptive routing + IP FRR + BFD.

ECMP
granularity: flow
5-tuple hash. The baseline.
✓ RDMA safe
WCMP
granularity: flow
Weighted by capacity. Asymmetric-friendly.
✓ RDMA safe
Adaptive
granularity: flowlet
Switch picks least-loaded path in real-time.
✓ RDMA safe
Packet spray
granularity: packet
Best balance, but reorder risk for RDMA RC.
✗ reorder risk
Flowlet
granularity: flowlet
Gap-based re-hashing. Limits reorder.
✓ RDMA safe
DLB
granularity: flow
Dynamic load balancing (Broadcom).
✓ RDMA safe
IP FRR
granularity: failover
Pre-computed backup. Sub-50ms switch.
✓ RDMA safe
Random
granularity: flow
No consistency. Don't use.
✗ reorder risk
① Link failure cascade

A leaf-spine fibre cut triggers a cascade across five layers. Each layer has its own reaction time — the slowest layer determines the black-hole window. Press play to walk through.

Link failure cascade — what breaks, in order
Spine 1Spine 2Spine 3Spine 4Leaf 1Leaf 2DETECTION & RECOVERY LAYERSPHY (ASIC)~5 msOS / NOS~50 msBFD~300 msBGP~600 msECMP~800 msidle · all healthy
Press play to trigger a leaf-spine link failure
② ECMP rehash on spine failure

Click any spine to fail it. Watch how ECMP redistributes flows across the survivors. With plain hash % N, losing 1 of 4 spines reassigns 75% of flows — not just the ones that were on the failed path.

ECMP rehash — click any spine to fail it
Spine 1 25% flowsSpine 2 25% flowsSpine 3 25% flowsSpine 4 25% flowsLeaf
All 4 spines alive. ECMP balanced — each carries ~25% of flows.
③ Detection speed race

Three detection methods racing on a log timescale. Plain BGP timers are ~10 000× slower than hardware link-state. BFD or link-state is the only choice for an AI fabric.

Detection speed race — three methods, same failure
DETECTION TIME (log scale: 1 ms → 100 s)Link-state (PHY)hardware FIB removes dead portBFD 3×10 ms (offload)ASIC-assisted BFDBFD 3×100 msrecommended baselineBGP only (default)30–90 s hold-down — job always crasheselapsed: 0 ms
Press play to compare detection methods on a log timescale
④ Convergence race vs RDMA retry budget

Network convergence must finish before RDMA exhausts its NACK retry budget (~350 ms after 7 retransmits). Compare the tuned config (BFD active) vs the default (no BFD) — same failure, opposite outcomes.

Convergence race — network repair vs RDMA NACK budget
TUNED (BFD 3×100ms) · timescale 0 → 800 msNetwork convergenceBFD + BGP withdraw + ECMP rebuild280 msRDMA NACK retry budget7 retransmits, exponential backoff ≈ 350 ms~350 ms · NCCL gives upelapsed: 0 ms
Pick a config and watch the two clocks race
The mitigation stack — deploy these together

No single feature is enough. Production AI fabrics layer five mechanisms. The first three are non-negotiable.

1
BFD — sub-second detection
Non-negotiable. Sub-second detection beats BGP defaults by 100×. Configure 3×100 ms on every leaf-spine adjacency.
2
Adaptive routing
Where the ASIC supports it (Spectrum-3+, Tomahawk-4+). Solves the steady-state hash polarisation problem.
3
IP FRR — pre-computed backup
Sub-50ms failover without waiting for BGP convergence. Especially useful when the dead link's ECMP rebuild is slow.
4
WCMP — capacity-weighted
When asymmetric capacity exists (e.g. partial failure), weights make surviving links carry proportionally more.
5
NCCL multiple QPs per peer
NCCL_IB_QPS_PER_CONNECTION=4 spreads each peer pair across 4 different UDP source ports — gives ECMP more flow diversity to hash.

The two questions this page answers

  1. How is traffic distributed across the fabric? ECMP is the default — but synchronised AllReduce traffic + elephant flows make plain hashing brittle. State 1 surveys the eight alternatives, and State 6 covers adaptive routing in depth.
  2. What happens when a link fails? A leaf-spine drop triggers a sequence of events. The black-hole window — between T = 0 and convergence — is when packets are silently dropped. State 2 walks the timeline, State 3 shows the modulo-hash disruption, State 4 compares detection mechanisms, State 5 shows the race against RDMA retransmit, and State 7 lists the mitigation stack you should deploy together.

Three practical defaults

  • BFD with 3 × 100 ms is the minimum on every leaf-spine adjacency. Plain BGP timers (30–90 s) always crash the job.
  • NCCL_IB_QPS_PER_CONNECTION=4 spreads each peer pair across 4 different UDP source ports — buys hash diversity even before the switch helps.
  • Adaptive routing where the ASIC supports it. Spectrum-3+, Tomahawk-4+, Trident-5. The price tag pays for itself the first time a flow polarises.