Skip to main content

Understanding AI Fabric Architecture

A traditional DC fabric and an AI fabric look identical from across the room — spine-leaf, BGP underlay, ECMP. Up close, four design rules flip. Each animation below shows one of them.

What stays the same, what changes

ElementTraditional DCAI Fabric
TopologySpine-leaf CLOSSpine-leaf CLOS ✓ same shape
UnderlayeBGPeBGP ✓ same
RoutingECMP (5-tuple hash)ECMP ✓ same — but the failure modes change
Oversubscription4:1 typical1:1 non-blocking
NICs per server1–2 bonded8 — one per GPU rail
Traffic shapeMany short flows, statistically independentFew huge elephant flows, synchronized
Tolerance for lossTCP recoversRDMA NACK = idle GPUs

1. Oversubscription

Traditional DC fabrics oversubscribe — a 32-port leaf has only 8 spine uplinks (4:1), because most servers are mostly idle. AI workloads are the opposite: every NIC pushes line-rate, simultaneously. That's why AI fabrics buy enough spine ports for 1:1 — and the spend is justified because tail latency = idle GPUs = wasted money.

Oversubscription — same load, different fates
TRADITIONAL DC — 4:1 oversub10% queue depth8 spine links (3.2 Tbps up)32 server links (12.8 Tbps down)AI FABRIC — 1:1 non-blocking5% queue depth32 spine links (12.8 Tbps up)32 server links (12.8 Tbps down)Press play to ramp the offered load on both fabrics
Same offered load, different oversub ratio

2. Rail-optimized topology

A traditional server has one or two NICs, both on the same ToR. An AI server has 8 NICs, each going to its own dedicated leaf — one per GPU. Each "rail" is an independent fabric end to end. A link failure on rail 4 only impacts GPU 4; the other 7 GPUs keep training. That's the blast-radius story.

Rail-optimized topology — one GPU per rail
Leaf 1Leaf 2Leaf 3Leaf 4Leaf 5Leaf 6Leaf 7Leaf 8AI server — 8× GPU + 8× RNICGPU 1GPU 2GPU 3GPU 4GPU 5GPU 6GPU 7GPU 8Rail-optimized topology — 8 GPUs, 8 independent rails
Each NIC has its own dedicated leaf — failure stays contained

3. Hash polarization

ECMP hashes the 5-tuple to spread flows evenly. With diverse traffic (web, API, DB), the law of large numbers makes this work great. With synchronized AllReduce — every GPU sending identically shaped flows at the same instant — many flows hash to the same path. One spine link saturates while three sit idle. This is the failure mode that destroys training throughput.

ECMP under synchronized collectives — hash polarization
Leaf ALeaf BSpine 125%Spine 225%Spine 325%Spine 425%Press play to watch the same fabric handle two very different traffic shapes
ECMP doesn't fail mathematically — it fails when flows are synchronized

4. ECMP and link failures

When a spine link goes down, ECMP rehashes the affected flows onto the survivors. This is fine for TCP — the kernel retransmits. For RDMA RC, in-flight packets on the now-dead path arrive out of order on a surviving path; the receiver's RNIC sees a PSN gap and sends a NACK. The sender retransmits the whole window. Multiply by every flow on the dead link and you get a short NACK storm.

ECMP under link failure — flows rehash, RDMA may NACK
Leaf ALeaf BSpine 125%Spine 225%Spine 325%Spine 425%Press play to see what happens when a spine link drops mid-training
Why AI fabrics minimise single-link failures with rail isolation + fast hardware retry

What hyperscalers do about it: rail isolation limits blast radius; adaptive routing and packet spraying (BlueField, Spectrum-X, Tomahawk-5) sidestep hash polarization; faster link-down detection (sub-millisecond LFI / BFD) and pre-computed FRR paths shrink the NACK window. Net result: a single spine link failure becomes a brief dip, not an outage.

The deeper details — pod sizing, super-spine, oversub math, mitigations for hash polarization — live in the three pages after this one.