Skip to main content

Understanding AI Fabric Architecture

An AI training fabric is the network that moves AllReduce traffic between thousands of GPUs. It looks like a spine-leaf CLOS at first glance — and structurally it is — but the way the components are wired, the bandwidth per host, and the operational rules underneath are different from any data-center fabric you've designed before. This page is the conceptual map of one.

At a glance

What is an AI fabric — comprehensive infographic. Left two-thirds: a pod of an AI training cluster showing 4 GPU servers each with 8 H100 GPUs and 8 ConnectX-7 NICs, an NVLink mesh inside each server at 900 GB/s (stays inside the box). Above the servers, 8 rail leaves arranged in a row, each carrying same-index NICs from every server (NIC 0 of all servers to Rail Leaf 0, etc — rail-optimized topology). Above the rail leaves, 4 spine switches connected in a non-blocking 1:1 Clos pattern, with a faded super-spine tier labeled "to other pods." Annotations call out 400 Gbps ConnectX-7 RoCE v2 NICs, 32×400G rail-leaf ports with ECMP plus adaptive routing, lossless PFC plus ECN with sub-microsecond tail latency, and 3.2 Tbps host fan-out (8 NICs at 400G each). Right third: four color-coded fabric domain cards — Back-end/Compute Fabric (cyan, carries AllReduce gradients on RoCE v2, 8 NICs × 400 Gbps = 3.2 Tbps per host, dedicated spine-leaf), Front-end/Management Fabric (indigo, Kubernetes/Slurm orchestration, monitoring, 2× dual-port 10/25 Gbps Ethernet, TCP/IP on standard DC switches), Storage Fabric (amber, dataset reads and checkpoint writes, 100/200 Gbps NICs, separate VRF or VLAN, never on the RoCE fabric), and Out-of-Band/OOB (grey, BMC/IPMI/Redfish, 1 GbE RJ45 always-on, the lights-out lifeline). Bottom band: "What makes an AI fabric different from a traditional DC fabric" — 4:1 oversubscription is normal becomes 1:1 non-blocking at every layer, p99 latency is the SLO becomes p99.99 tail latency dominates, one ToR aggregates all NICs becomes rail-optimized with each NIC on its own leaf, a dropped packet retries becomes zero loss via PFC + ECN.

The rest of this page is a guided tour of that diagram. Skim the image first; the prose then walks each piece.


1. The components — a guided tour

Read the diagram bottom to top.

GPU servers (bottom row). Each box is a single AI training host — typically 8 H100 GPUs on an HGX baseboard, plus 8 ConnectX-7 NICs, one NIC per GPU. Inside the chassis, the 8 GPUs talk to each other over an NVLink mesh at 900 GB/s through 4 NVSwitch chips. That traffic stays inside the box; it never reaches the fabric you're about to design.

Rail leaves (middle row). The structural break from a traditional DC. Instead of one ToR aggregating all 8 NICs on a server, each NIC lands on its own dedicated leaf — a "rail leaf." NIC 0 from every server connects to Rail Leaf 0; NIC 1 from every server connects to Rail Leaf 1; and so on. 8 NICs per server = 8 separate rail fabrics, each operationally independent. Same-index GPUs across the cluster are exactly one hop apart by design.

Spine (top row). Above the rail leaves, a non-blocking Clos. Every rail leaf connects to every spine, 1:1 oversubscription end-to-end — no bottlenecks anywhere on the back-end path. For pods larger than ~1K GPUs, a super-spine tier stitches multiple pods together.

The whole stack has one job: move AllReduce traffic between same-index GPUs with sub-microsecond tail latency and zero packet loss.


2. The four fabrics in one cluster

The diagram's right column makes this explicit: one AI training cluster has four physically separate networks, not one. They share the same datacenter floor but never share switches.

FabricCarriesSpeedSwitches
Back-end / Compute (cyan)AllReduce, gradients · RoCE v2 · the path this whole curriculum is about400 Gbps per NIC · 3.2 Tbps host fan-outDedicated spine-leaf, rail-optimized
Front-end / Management (indigo)Kubernetes/Slurm, monitoring scrape, log shipping · TCP/IP10–25 Gbps, 2× dual-port per hostStandard DC switches
Storage (amber)Dataset reads, checkpoint writes100/200 GbpsSeparate VRF / VLAN — never on the RoCE fabric
Out-of-Band (OOB) (grey)BMC, IPMI, Redfish · lights-out management1 GbE always-onSeparate small management switches

The rule: back-end and front-end never share a switch. The reason is microseconds — a storage burst or a monitoring scrape mixed onto the RoCE fabric eats headroom the AllReduce was counting on, and tail latency blows up. Vendors enforce this with separate ToRs.


3. What makes it different from a traditional DC fabric

The bottom band of the diagram is the executive summary. Same CLOS shape, four rules flipped:

  • 4:1 oversubscription is normal in the DC you know. 1:1 non-blocking at every layer is the AI fabric rule.
  • p99 latency is the DC SLO. p99.99 tail latency is what dominates an AI fabric — one slow link stalls thousands of GPUs.
  • One ToR aggregates all server NICs in a traditional design. Rail-optimized spreads each NIC to its own leaf.
  • A dropped packet retries on a TCP fabric. Zero loss is mandatory on an AI fabric — RDMA has no graceful retransmit.

The five-page rest of this chapter unpacks why each rule flipped and how you build for it.


4. The same view, interactive

If you want to flip between "traditional DC" and "AI fabric" rule-by-rule, the embedded component below does exactly that — same CLOS shape, four rules animated.

A traditional DC fabric and an AI fabric look identical from across the room — spine-leaf, BGP underlay, ECMP. Up close, four design rules flip. Each animation below shows one of them.

What stays the same, what changes

ElementTraditional DCAI Fabric
TopologySpine-leaf CLOSSpine-leaf CLOS ✓ same shape
UnderlayeBGPeBGP ✓ same
RoutingECMP (5-tuple hash)ECMP ✓ same — but the failure modes change
Oversubscription4:1 typical1:1 non-blocking
NICs per server1–2 bonded8 — one per GPU rail
Traffic shapeMany short flows, statistically independentFew huge elephant flows, synchronized
Tolerance for lossTCP recoversRDMA NACK = idle GPUs

1. Oversubscription

Traditional DC fabrics oversubscribe — a 32-port leaf has only 8 spine uplinks (4:1), because most servers are mostly idle. AI workloads are the opposite: every NIC pushes line-rate, simultaneously. That's why AI fabrics buy enough spine ports for 1:1 — and the spend is justified because tail latency = idle GPUs = wasted money.

Oversubscription — same load, different fates
TRADITIONAL DC — 4:1 oversub10% queue depth8 spine links (3.2 Tbps up)32 server links (12.8 Tbps down)AI FABRIC — 1:1 non-blocking5% queue depth32 spine links (12.8 Tbps up)32 server links (12.8 Tbps down)Press play to ramp the offered load on both fabrics
Same offered load, different oversub ratio

2. Rail-optimized topology

A traditional server has one or two NICs, both on the same ToR. An AI server has 8 NICs, each going to its own dedicated leaf — one per GPU. Each "rail" is an independent fabric end to end. A link failure on rail 4 only impacts GPU 4; the other 7 GPUs keep training. That's the blast-radius story.

Rail-optimized topology — one GPU per rail
Leaf 1Leaf 2Leaf 3Leaf 4Leaf 5Leaf 6Leaf 7Leaf 8AI server — 8× GPU + 8× RNICGPU 1GPU 2GPU 3GPU 4GPU 5GPU 6GPU 7GPU 8Rail-optimized topology — 8 GPUs, 8 independent rails
Each NIC has its own dedicated leaf — failure stays contained

3. Hash polarization

ECMP hashes the 5-tuple to spread flows evenly. With diverse traffic (web, API, DB), the law of large numbers makes this work great. With synchronized AllReduce — every GPU sending identically shaped flows at the same instant — many flows hash to the same path. One spine link saturates while three sit idle. This is the failure mode that destroys training throughput.

ECMP under synchronized collectives — hash polarization
Leaf ALeaf BSpine 125%Spine 225%Spine 325%Spine 425%Press play to watch the same fabric handle two very different traffic shapes
ECMP doesn't fail mathematically — it fails when flows are synchronized

4. ECMP and link failures

When a spine link goes down, ECMP rehashes the affected flows onto the survivors. This is fine for TCP — the kernel retransmits. For RDMA RC, in-flight packets on the now-dead path arrive out of order on a surviving path; the receiver's RNIC sees a PSN gap and sends a NACK. The sender retransmits the whole window. Multiply by every flow on the dead link and you get a short NACK storm.

ECMP under link failure — flows rehash, RDMA may NACK
Leaf ALeaf BSpine 125%Spine 225%Spine 325%Spine 425%Press play to see what happens when a spine link drops mid-training
Why AI fabrics minimise single-link failures with rail isolation + fast hardware retry

What hyperscalers do about it: rail isolation limits blast radius; adaptive routing and packet spraying (BlueField, Spectrum-X, Tomahawk-5) sidestep hash polarization; faster link-down detection (sub-millisecond LFI / BFD) and pre-computed FRR paths shrink the NACK window. Net result: a single spine link failure becomes a brief dip, not an outage.

The deeper details — pod sizing, super-spine, oversub math, mitigations for hash polarization — live in the three pages after this one.


5. Where this chapter goes

The image and this page are the conceptual map. The remaining pages each take one piece of it deep:

  • Design Options — the four design patterns (ROD, RUD, Scheduled, Multi-Planar) and how to pick.
  • Rail-Optimized Design — deep dive on the default pattern. Pod sizing, blast radius, NCCL ring isolation.
  • Cluster Sizing & Cabling — reference designs at 256 → 100K GPUs, switch radix, optics, day-1 install reality.

The next chapter — 04. Load Balancing in AI Fabrics — is the dedicated deep-dive on how the back-end fabric actually moves bytes once this design is in place. ECMP, hash polarization, DLB, GLB, TELB, and a live simulator.


💡 What you should remember

🏗️An AI fabric IS spine-leaf CLOSThe topology shape is familiar. Leaves, spines, super-spines, eBGP underlay — all there.
🛤️Rail-optimized is the structural breakEach GPU's NIC lands on its own dedicated leaf. 8 NICs per host = 8 separate rail fabrics.
🧵Four fabrics, one clusterBack-end (RoCE) · Front-end (mgmt) · Storage · OOB. Never share switches.
🚦Back-end carries AllReduce; nothing else3.2 Tbps host fan-out, 1:1 non-blocking, lossless. The other three fabrics exist so the back-end stays clean.
⏱️p99.99 tail latency is the only latencyOne slow link stalls thousands of GPUs. Design for the worst link, not the average.

Next: Design Options → — the four fabric design patterns and how to pick between them. The catalog before the deep dives.