Skip to main content

The curriculum

AI networking, distilled — for network engineers crossing from traditional DC fabrics into AI training and inference networks. Same fabric vocabulary, very different design rules.

New here? Start with Transport & Congestion Control — the design-space map for the two layers that make an AI fabric an AI fabric. Read it before going deep so you know where everything else sits.

Browse all sections

Phase 1 — The machine

  • AI Training Basics — what AI training actually does, the collectives that drive it, parallelism strategies, and the MFU diagnosis ladder.
  • GPU & Server Hardware — GPU vs CPU, inside a GPU server (NVLink, NVSwitch, RDMA NICs, PCIe), the dominant vendors.

Phase 2 — The fabric

  • AI Fabric Architecture — the shape of the network around AI clusters. Spine-leaf with AI twists, rail-optimized topology, design pattern catalog, cluster sizing.
  • Load Balancing in AI Fabrics — why ECMP fails on AI training traffic and what to do about it. The four LB tiers (SLB, DLB, GLB, TELB) plus a live simulator.

Phase 3 — Making it lossless

Phase 4 — What rides on the wire

  • HPC Networking — animated map of how RDMA, InfiniBand, and RoCEv2 fit together. Watch this first.
  • RDMA — a technique, not a protocol. Kernel bypass, verbs, queue pairs, memory regions.
  • InfiniBand — the native RDMA fabric. Credit-based flow control, Subnet Manager. For reference and comparison.
  • RoCE v2 — InfiniBand transport over standard UDP/IP/Ethernet. The fabric this curriculum picks.
  • Communication Libraries — NCCL, RCCL, oneCCL. The libraries every training framework calls.

Phase 5 — Host & orchestration

Phase 6 — Build & operate

  • HPC Cluster Designs — the 15-layer cluster stack, plus five concrete provisioning designs (K8s + SR-IOV, bare metal + Slurm, K8s + physical NIC, bare metal + MPI, hybrid).
  • Building a Training Cluster — how to actually deploy. Bare metal, VM, container, Kubernetes, cloud.
  • Inference Networking — inference is a different network problem. Latency-critical, KV-cache movement, RAG, MCP.
  • Production Operations — what to monitor, common failure modes, 3 AM playbooks.
  • Cluster Build Guide — the practical step-by-step. BoM, fabric/host/k8s config, validation.
  • Life of an AI Job in Fabric — capstone. End-to-end walkthrough of a training job: submit → schedule → NCCL bootstrap → forward → AllReduce → checkpoint. Every concept from prior chapters in motion.

How this is built

  • Vendor-neutral. NVIDIA, Broadcom, Cisco, Arista, Juniper — evaluated on technical merit.
  • Source material. Public RFCs, IEEE standards, vendor docs, academic papers, home-lab and production experience.
  • Free. Apache 2.0 / CC BY 4.0 — share, remix, build on it.
  • Personal views. Not affiliated with any employer.

Modules drop one at a time, polished, in the blog — that's where field notes, RFC walk-throughs, and incident write-ups live between releases.