AI Fabric — Master Reference
The full reference for the AI fabric — every layer, every protocol, every failure mode. Sixteen interactive sections covering Clos and fat-tree fundamentals, ECMP and its seven failure modes, lossless Ethernet mechanics (PFC + DCQCN), switch ASIC and buffer architecture, and the design patterns hyperscalers use in production.
Topics covered in the guide
- Why AI needs a different fabric — what changes vs a traditional DC
- Clos network — the foundation
- Fat-tree — scaling the Clos
- ECMP — Equal Cost Multi-Path
- ECMP with RoCEv2 — why UDP source port is critical
- ECMP issues — 7 failure modes (polarisation, elephant flows, microbursts, …)
- ECMP solutions — adaptive routing, packet spraying, multiple QPs, more
- Lossless fabric — why and how
- PFC — Priority Flow Control deep dive
- PFC issues — when lossless goes wrong (deadlock, pause storms)
- DCQCN — congestion control for RDMA
- QoS — traffic classification and queuing
- Switch ASIC + buffer architecture — how buffers really work
- AI fabric design patterns — what hyperscalers actually deploy
- Fabric monitoring + telemetry — what to watch
- Master reference — cheatsheet of values and defaults
How this fits the curriculum
This is the deep-reference companion to the lighter, animation-first Understanding AI Fabric page. If that one is the 5-minute visual quick-pick, this is the 45-minute reference you come back to.
Loading the master guide…
Where to go from here
- Fabric Load Balancing & Link Failover — the failover companion guide
- RoCE v2 — wire-level packet anatomy + PFC/ECN/DCQCN deep dive
- Switch QoS — actual switch configs for lossless RoCE