Understanding AI Fabric Architecture

A traditional DC fabric and an AI fabric look identical from across the room — spine-leaf, BGP underlay, ECMP. Up close, four design rules flip. Each animation below shows one of them.

What stays the same, what changes

Element	Traditional DC	AI Fabric
Topology	Spine-leaf CLOS	Spine-leaf CLOS ✓ same shape
Underlay	eBGP	eBGP ✓ same
Routing	ECMP (5-tuple hash)	ECMP ✓ same — but the failure modes change
Oversubscription	4:1 typical	1:1 non-blocking
NICs per server	1–2 bonded	8 — one per GPU rail
Traffic shape	Many short flows, statistically independent	Few huge elephant flows, synchronized
Tolerance for loss	TCP recovers	RDMA NACK = idle GPUs

1. Oversubscription

Traditional DC fabrics oversubscribe — a 32-port leaf has only 8 spine uplinks (4:1), because most servers are mostly idle. AI workloads are the opposite: every NIC pushes line-rate, simultaneously. That's why AI fabrics buy enough spine ports for 1:1 — and the spend is justified because tail latency = idle GPUs = wasted money.

Oversubscription — same load, different fates

Same offered load, different oversub ratio

2. Rail-optimized topology

A traditional server has one or two NICs, both on the same ToR. An AI server has 8 NICs, each going to its own dedicated leaf — one per GPU. Each "rail" is an independent fabric end to end. A link failure on rail 4 only impacts GPU 4; the other 7 GPUs keep training. That's the blast-radius story.

Rail-optimized topology — one GPU per rail

Each NIC has its own dedicated leaf — failure stays contained

3. Hash polarization

ECMP hashes the 5-tuple to spread flows evenly. With diverse traffic (web, API, DB), the law of large numbers makes this work great. With synchronized AllReduce — every GPU sending identically shaped flows at the same instant — many flows hash to the same path. One spine link saturates while three sit idle. This is the failure mode that destroys training throughput.

ECMP under synchronized collectives — hash polarization

ECMP doesn't fail mathematically — it fails when flows are synchronized

4. ECMP and link failures

When a spine link goes down, ECMP rehashes the affected flows onto the survivors. This is fine for TCP — the kernel retransmits. For RDMA RC, in-flight packets on the now-dead path arrive out of order on a surviving path; the receiver's RNIC sees a PSN gap and sends a NACK. The sender retransmits the whole window. Multiply by every flow on the dead link and you get a short NACK storm.

ECMP under link failure — flows rehash, RDMA may NACK

Why AI fabrics minimise single-link failures with rail isolation + fast hardware retry

What hyperscalers do about it: rail isolation limits blast radius; adaptive routing and packet spraying (BlueField, Spectrum-X, Tomahawk-5) sidestep hash polarization; faster link-down detection (sub-millisecond LFI / BFD) and pre-computed FRR paths shrink the NACK window. Net result: a single spine link failure becomes a brief dip, not an outage.

The deeper details — pod sizing, super-spine, oversub math, mitigations for hash polarization — live in the three pages after this one.