AI Fabric Architecture | Lossless Network

→3.1 Understanding AI Fabric Architecture

What an AI training fabric IS — the components, the four-fabric model, and what makes it different from the traditional DC network you already know.

→3.2 Design Options

The four fabric design patterns — ROD, RUD, Scheduled, Multi-Planar — at a glance. A decision tree, a comparison matrix, and the multi-tenancy question.

→3.3 Rail-Optimized Design (ROD)

What "rails" mean in an AI fabric, why each GPU gets its own dedicated leaf, how this changes blast radius, and pod sizing.

→3.4 Switches for AI

The dominant switch vendors in AI fabrics and the network OS each runs — NVIDIA Spectrum-X, Arista, Cisco, Juniper/HPE, and white-box SONiC. Who builds the box, and what AI changes about a switch.

→3.5 Switch Silicon

The merchant ASICs inside AI switches — Broadcom Tomahawk vs Jericho, NVIDIA Spectrum-4, Cisco Silicon One, Marvell Teralynx. The shallow-buffer vs deep-buffer divide that defines an AI fabric.

The RDMA NICs and DPUs at the host edge of an AI fabric — NVIDIA ConnectX/BlueField, Broadcom Thor, Intel E810, AWS EFA, AMD Pollara. What a NIC does in an AI fabric, NIC vs DPU, and the UEC next wave.

→3.7 Cluster Sizing & Cabling

Reference fabric designs from 1024 → 100K GPU scale. Switch radix math, transceiver choices (OSFP, AOC, DAC), and the cabling labelling scheme that survives day-1 install.

→3.8 Master Reference

The whole chapter in one scroll. An interactive deep-dive covering Clos, fat-tree, ECMP, ECMP-with-RoCEv2, the seven ECMP failure modes, lossless mechanisms, and AI fabric design patterns — animated.