AI Fabric vs Traditional CLOS
You've designed leaf-spine networks for years. Web tier, database tier, multi-tenant DC — the spine-leaf CLOS is your friend. ECMP balances flows statistically. 4:1 oversubscription is normal. A dropped packet here or there is recoverable.
An AI fabric still uses spine-leaf. The topology shape is recognizable. But the design rules underneath flip in five important ways. Here they are.
What stays the same
The good news: most of what you know carries over.
- Spine-leaf CLOS. Two-tier (leaf + spine) or three-tier (leaf + spine + super-spine). Every leaf connects to every spine. Familiar.
- BGP underlays. Most AI fabrics run eBGP between leaves and spines (Meta, Microsoft, Oracle all do this). Same tooling, same conventions you've used.
- ECMP routing. Equal-cost multi-path is still how packets fan out across spines. Same
hash(5-tuple)behavior. - Fault domains. A pod is a pod. A rack is a rack. Lose a spine, you lose
1/Nof the inter-pod bandwidth. - Underlay vs overlay. VXLAN over BGP-EVPN is still possible (less common for training fabrics, more common for management).
If you can troubleshoot a traditional CLOS, you can troubleshoot most of an AI fabric.
What changes
Five differences, in rough order of how often they'll bite you.
1. Oversubscription: 1:1, not 4:1
Traditional DC fabrics are typically 4:1 oversubscribed at the leaf (uplink capacity = 1/4 of downlink capacity). Reason: most servers idle most of the time; the rare burst doesn't fill the uplinks.
AI fabrics are 1:1 — non-oversubscribed. If a leaf has 32 × 400G downlinks to servers, it has 32 × 400G uplinks to spines. Every server can send at full line rate, simultaneously, into the fabric.
Why: during AllReduce, every GPU sends at full rate at the same time. There is no quiet time. Oversubscription means tail latency, and tail latency means idle GPUs.
2. Zero tolerance for tail latency
In a traditional DC, p99 latency matters a little. In an AI fabric, p99.99 latency matters a lot. One slow link stalls thousands of GPUs.
You design for the worst link, not the average. This changes:
- Buffer profiles. Tuned to absorb collective bursts without dropping.
- PFC headroom. Sized so PFC fires before a drop, not after.
- Telemetry. Per-flow microsecond-resolution latency, not minute averages.
3. RDMA-aware QoS
Your CLOS has QoS. It probably has CoS / DSCP classes for management, storage, control plane. An AI fabric adds two more layers:
- PFC (Priority Flow Control) — pauses sender per priority class, configured on the RoCE v2 traffic class. Required for lossless.
- ECN (Explicit Congestion Notification) marking at egress — turned up so DCQCN on the NIC can dial back rate proactively.
Both of these were optional curiosities in a traditional DC. They're mandatory on an AI fabric. Section 9 covers the configuration in detail.
4. Rail-optimized topology
This is the structural twist that makes AI fabrics look weird at first sight. Instead of one ToR aggregating all 8 NICs from a server, each NIC goes to its own dedicated leaf ("rail leaf").
An 8-GPU server has 8 NICs, each going to a different rail leaf — and the spines are arranged so that GPU-0 on every server connects to "rail 0," GPU-1 on every server connects to "rail 1," and so on.
Next page covers this in detail.
5. Elephant flows break ECMP
ECMP works because flows are statistically independent. AI training flows are the opposite — synchronized, identical in shape, persistent for seconds. The 5-tuple hash collapses into hot spots: 6 of 8 elephant flows hash to the same egress port, the other 2 are unused.
Page 7.3 covers this and what hyperscalers do about it (packet spraying, adaptive routing — recap from Section 1).
What you already know vs what's new
| You know | What's new |
|---|---|
| 4:1 oversubscription is normal | 1:1 — no oversub anywhere in the AI fabric |
| ECMP balances flows statistically | Synchronized collectives break that — hash polarization on elephants |
| QoS for storage / management traffic | PFC + ECN are mandatory for RoCE v2 |
| Average latency matters | Tail latency is the only latency that matters |
| One ToR per rack, aggregating all server NICs | Rail-optimized: each NIC has its own dedicated leaf |
Everything from the left column still works as the starting point. The right column is what this section teaches.
What you should remember
- Spine-leaf is still spine-leaf. Most of your CLOS instincts apply.
- 1:1 oversubscription, not 4:1. AI fabrics can't tolerate the tail-latency penalty of oversub.
- PFC + ECN are mandatory. They're not optional features here.
- Rail-optimized topology is the structural difference — each GPU's NIC goes to its own dedicated leaf.
- Synchronized collectives break ECMP. Hash polarization on elephant flows is the chronic issue.
Next: Rail-Optimized Topology → — what "rails" mean, why each GPU gets its own, and how this changes blast radius.