Skip to main content

AI Fabric — Master Reference

The full reference for the AI fabric — every layer, every protocol, every failure mode. Sixteen interactive sections covering Clos and fat-tree fundamentals, ECMP and its seven failure modes, lossless Ethernet mechanics (PFC + DCQCN), switch ASIC and buffer architecture, and the design patterns hyperscalers use in production.

Topics covered in the guide

  1. Why AI needs a different fabric — what changes vs a traditional DC
  2. Clos network — the foundation
  3. Fat-tree — scaling the Clos
  4. ECMP — Equal Cost Multi-Path
  5. ECMP with RoCEv2 — why UDP source port is critical
  6. ECMP issues — 7 failure modes (polarisation, elephant flows, microbursts, …)
  7. ECMP solutions — adaptive routing, packet spraying, multiple QPs, more
  8. Lossless fabric — why and how
  9. PFC — Priority Flow Control deep dive
  10. PFC issues — when lossless goes wrong (deadlock, pause storms)
  11. DCQCN — congestion control for RDMA
  12. QoS — traffic classification and queuing
  13. Switch ASIC + buffer architecture — how buffers really work
  14. AI fabric design patterns — what hyperscalers actually deploy
  15. Fabric monitoring + telemetry — what to watch
  16. Master reference — cheatsheet of values and defaults

How this fits the curriculum

This is the deep-reference companion to the lighter, animation-first Understanding AI Fabric page. If that one is the 5-minute visual quick-pick, this is the 45-minute reference you come back to.

Loading the master guide…

Where to go from here