Fabric Load Balancing & Link Failover
How an AI fabric balances flows across spines, and what happens — second by second — when a spine link drops. Seven states, every layer named.
AI fabrics rarely use just one. ECMP is the default, but the other seven exist because ECMP alone is brittle under synchronised AllReduce traffic. Production designs combine ECMP + adaptive routing + IP FRR + BFD.
A leaf-spine fibre cut triggers a cascade across five layers. Each layer has its own reaction time — the slowest layer determines the black-hole window. Press play to walk through.
Click any spine to fail it. Watch how ECMP redistributes flows across the survivors. With plain hash % N, losing 1 of 4 spines reassigns 75% of flows — not just the ones that were on the failed path.
Three detection methods racing on a log timescale. Plain BGP timers are ~10 000× slower than hardware link-state. BFD or link-state is the only choice for an AI fabric.
Network convergence must finish before RDMA exhausts its NACK retry budget (~350 ms after 7 retransmits). Compare the tuned config (BFD active) vs the default (no BFD) — same failure, opposite outcomes.
No single feature is enough. Production AI fabrics layer five mechanisms. The first three are non-negotiable.
The two questions this page answers
- How is traffic distributed across the fabric? ECMP is the default — but synchronised AllReduce traffic + elephant flows make plain hashing brittle. State 1 surveys the eight alternatives, and State 6 covers adaptive routing in depth.
- What happens when a link fails? A leaf-spine drop triggers a sequence of events. The black-hole window — between T = 0 and convergence — is when packets are silently dropped. State 2 walks the timeline, State 3 shows the modulo-hash disruption, State 4 compares detection mechanisms, State 5 shows the race against RDMA retransmit, and State 7 lists the mitigation stack you should deploy together.
Three practical defaults
- BFD with 3 × 100 ms is the minimum on every leaf-spine adjacency. Plain BGP timers (30–90 s) always crash the job.
NCCL_IB_QPS_PER_CONNECTION=4spreads each peer pair across 4 different UDP source ports — buys hash diversity even before the switch helps.- Adaptive routing where the ASIC supports it. Spectrum-3+, Tomahawk-4+, Trident-5. The price tag pays for itself the first time a flow polarises.