Skip to main content

AI Fabric vs Traditional CLOS

You've designed leaf-spine networks for years. Web tier, database tier, multi-tenant DC — the spine-leaf CLOS is your friend. ECMP balances flows statistically. 4:1 oversubscription is normal. A dropped packet here or there is recoverable.

An AI fabric still uses spine-leaf. The topology shape is recognizable. But the design rules underneath flip in five important ways. Here they are.


What stays the same

The good news: most of what you know carries over.

  • Spine-leaf CLOS. Two-tier (leaf + spine) or three-tier (leaf + spine + super-spine). Every leaf connects to every spine. Familiar.
  • BGP underlays. Most AI fabrics run eBGP between leaves and spines (Meta, Microsoft, Oracle all do this). Same tooling, same conventions you've used.
  • ECMP routing. Equal-cost multi-path is still how packets fan out across spines. Same hash(5-tuple) behavior.
  • Fault domains. A pod is a pod. A rack is a rack. Lose a spine, you lose 1/N of the inter-pod bandwidth.
  • Underlay vs overlay. VXLAN over BGP-EVPN is still possible (less common for training fabrics, more common for management).

If you can troubleshoot a traditional CLOS, you can troubleshoot most of an AI fabric.


What changes

Five differences, in rough order of how often they'll bite you.

1. Oversubscription: 1:1, not 4:1

Traditional DC fabrics are typically 4:1 oversubscribed at the leaf (uplink capacity = 1/4 of downlink capacity). Reason: most servers idle most of the time; the rare burst doesn't fill the uplinks.

AI fabrics are 1:1 — non-oversubscribed. If a leaf has 32 × 400G downlinks to servers, it has 32 × 400G uplinks to spines. Every server can send at full line rate, simultaneously, into the fabric.

Why: during AllReduce, every GPU sends at full rate at the same time. There is no quiet time. Oversubscription means tail latency, and tail latency means idle GPUs.

2. Zero tolerance for tail latency

In a traditional DC, p99 latency matters a little. In an AI fabric, p99.99 latency matters a lot. One slow link stalls thousands of GPUs.

You design for the worst link, not the average. This changes:

  • Buffer profiles. Tuned to absorb collective bursts without dropping.
  • PFC headroom. Sized so PFC fires before a drop, not after.
  • Telemetry. Per-flow microsecond-resolution latency, not minute averages.

3. RDMA-aware QoS

Your CLOS has QoS. It probably has CoS / DSCP classes for management, storage, control plane. An AI fabric adds two more layers:

  • PFC (Priority Flow Control) — pauses sender per priority class, configured on the RoCE v2 traffic class. Required for lossless.
  • ECN (Explicit Congestion Notification) marking at egress — turned up so DCQCN on the NIC can dial back rate proactively.

Both of these were optional curiosities in a traditional DC. They're mandatory on an AI fabric. Section 9 covers the configuration in detail.

4. Rail-optimized topology

This is the structural twist that makes AI fabrics look weird at first sight. Instead of one ToR aggregating all 8 NICs from a server, each NIC goes to its own dedicated leaf ("rail leaf").

An 8-GPU server has 8 NICs, each going to a different rail leaf — and the spines are arranged so that GPU-0 on every server connects to "rail 0," GPU-1 on every server connects to "rail 1," and so on.

Next page covers this in detail.

5. Elephant flows break ECMP

ECMP works because flows are statistically independent. AI training flows are the opposite — synchronized, identical in shape, persistent for seconds. The 5-tuple hash collapses into hot spots: 6 of 8 elephant flows hash to the same egress port, the other 2 are unused.

Page 7.3 covers this and what hyperscalers do about it (packet spraying, adaptive routing — recap from Section 1).


What you already know vs what's new

You knowWhat's new
4:1 oversubscription is normal1:1 — no oversub anywhere in the AI fabric
ECMP balances flows statisticallySynchronized collectives break that — hash polarization on elephants
QoS for storage / management trafficPFC + ECN are mandatory for RoCE v2
Average latency mattersTail latency is the only latency that matters
One ToR per rack, aggregating all server NICsRail-optimized: each NIC has its own dedicated leaf

Everything from the left column still works as the starting point. The right column is what this section teaches.


What you should remember

  • Spine-leaf is still spine-leaf. Most of your CLOS instincts apply.
  • 1:1 oversubscription, not 4:1. AI fabrics can't tolerate the tail-latency penalty of oversub.
  • PFC + ECN are mandatory. They're not optional features here.
  • Rail-optimized topology is the structural difference — each GPU's NIC goes to its own dedicated leaf.
  • Synchronized collectives break ECMP. Hash polarization on elephant flows is the chronic issue.

Next: Rail-Optimized Topology → — what "rails" mean, why each GPU gets its own, and how this changes blast radius.