The curriculum
AI networking, distilled — for network engineers crossing from traditional DC fabrics into AI training and inference networks. Same fabric vocabulary, very different design rules.
New here? Start with Transport & Congestion Control — the design-space map for the two layers that make an AI fabric an AI fabric. Read it before going deep so you know where everything else sits.
Browse all sections
Phase 1 — The machine
- AI Training Basics — what AI training actually does, the collectives that drive it, parallelism strategies, and the MFU diagnosis ladder.
- GPU & Server Hardware — GPU vs CPU, inside a GPU server (NVLink, NVSwitch, RDMA NICs, PCIe), the dominant vendors.
Phase 2 — The fabric
- AI Fabric Architecture — the shape of the network around AI clusters. Spine-leaf with AI twists, rail-optimized topology, design pattern catalog, cluster sizing.
- Load Balancing in AI Fabrics — why ECMP fails on AI training traffic and what to do about it. The four LB tiers (SLB, DLB, GLB, TELB) plus a live simulator.
Phase 3 — Making it lossless
- Transport & Congestion Control — the design-space map. Why AI fabrics are different from TCP networks.
- Switch QoS — configure the switch for lossless RoCE v2. PFC, ECN, DCQCN, buffer profiles.
Phase 4 — What rides on the wire
- HPC Networking — animated map of how RDMA, InfiniBand, and RoCEv2 fit together. Watch this first.
- RDMA — a technique, not a protocol. Kernel bypass, verbs, queue pairs, memory regions.
- InfiniBand — the native RDMA fabric. Credit-based flow control, Subnet Manager. For reference and comparison.
- RoCE v2 — InfiniBand transport over standard UDP/IP/Ethernet. The fabric this curriculum picks.
- Communication Libraries — NCCL, RCCL, oneCCL. The libraries every training framework calls.
Phase 5 — Host & orchestration
- Host Networking — the host side. SR-IOV, Multus, nvidia-peermem, NCCL config, GPU + Network Operator stack.
- Linux for Network Engineers — the Linux side of an AI cluster, from a network engineer who knows IOS/Junos.
- Kubernetes for Network Engineers — k8s from the perspective of someone who's never touched it.
Phase 6 — Build & operate
- HPC Cluster Designs — the 15-layer cluster stack, plus five concrete provisioning designs (K8s + SR-IOV, bare metal + Slurm, K8s + physical NIC, bare metal + MPI, hybrid).
- Building a Training Cluster — how to actually deploy. Bare metal, VM, container, Kubernetes, cloud.
- Inference Networking — inference is a different network problem. Latency-critical, KV-cache movement, RAG, MCP.
- Production Operations — what to monitor, common failure modes, 3 AM playbooks.
- Cluster Build Guide — the practical step-by-step. BoM, fabric/host/k8s config, validation.
- Life of an AI Job in Fabric — capstone. End-to-end walkthrough of a training job: submit → schedule → NCCL bootstrap → forward → AllReduce → checkpoint. Every concept from prior chapters in motion.
How this is built
- Vendor-neutral. NVIDIA, Broadcom, Cisco, Arista, Juniper — evaluated on technical merit.
- Source material. Public RFCs, IEEE standards, vendor docs, academic papers, home-lab and production experience.
- Free. Apache 2.0 / CC BY 4.0 — share, remix, build on it.
- Personal views. Not affiliated with any employer.
Modules drop one at a time, polished, in the blog — that's where field notes, RFC walk-throughs, and incident write-ups live between releases.