Understanding AI Fabric Architecture
A traditional DC fabric and an AI fabric look identical from across the room — spine-leaf, BGP underlay, ECMP. Up close, four design rules flip. Each animation below shows one of them.
What stays the same, what changes
| Element | Traditional DC | AI Fabric |
|---|---|---|
| Topology | Spine-leaf CLOS | Spine-leaf CLOS ✓ same shape |
| Underlay | eBGP | eBGP ✓ same |
| Routing | ECMP (5-tuple hash) | ECMP ✓ same — but the failure modes change |
| Oversubscription | 4:1 typical | 1:1 non-blocking |
| NICs per server | 1–2 bonded | 8 — one per GPU rail |
| Traffic shape | Many short flows, statistically independent | Few huge elephant flows, synchronized |
| Tolerance for loss | TCP recovers | RDMA NACK = idle GPUs |
1. Oversubscription
Traditional DC fabrics oversubscribe — a 32-port leaf has only 8 spine uplinks (4:1), because most servers are mostly idle. AI workloads are the opposite: every NIC pushes line-rate, simultaneously. That's why AI fabrics buy enough spine ports for 1:1 — and the spend is justified because tail latency = idle GPUs = wasted money.
2. Rail-optimized topology
A traditional server has one or two NICs, both on the same ToR. An AI server has 8 NICs, each going to its own dedicated leaf — one per GPU. Each "rail" is an independent fabric end to end. A link failure on rail 4 only impacts GPU 4; the other 7 GPUs keep training. That's the blast-radius story.
3. Hash polarization
ECMP hashes the 5-tuple to spread flows evenly. With diverse traffic (web, API, DB), the law of large numbers makes this work great. With synchronized AllReduce — every GPU sending identically shaped flows at the same instant — many flows hash to the same path. One spine link saturates while three sit idle. This is the failure mode that destroys training throughput.
4. ECMP and link failures
When a spine link goes down, ECMP rehashes the affected flows onto the survivors. This is fine for TCP — the kernel retransmits. For RDMA RC, in-flight packets on the now-dead path arrive out of order on a surviving path; the receiver's RNIC sees a PSN gap and sends a NACK. The sender retransmits the whole window. Multiply by every flow on the dead link and you get a short NACK storm.
What hyperscalers do about it: rail isolation limits blast radius; adaptive routing and packet spraying (BlueField, Spectrum-X, Tomahawk-5) sidestep hash polarization; faster link-down detection (sub-millisecond LFI / BFD) and pre-computed FRR paths shrink the NACK window. Net result: a single spine link failure becomes a brief dip, not an outage.
The deeper details — pod sizing, super-spine, oversub math, mitigations for hash polarization — live in the three pages after this one.