Sound familiar?
After Lossless Network.

How it works.
5 steps. Theory → RCA.- 01
Read the theory. In your language.
Every concept bridged to networking you already know. No ML jargon without a translation.
ExamplePFC = backpressure on your switch ports. SR-IOV = VRF for NICs. Multus = secondary network attachments. NCCL ring AllReduce = a token bus, but for gradients. - 02
See the full picture.
The complete AI fabric, walked end to end. Annotated diagrams, every hop labeled.
ExampleA single AllReduce flow: GPU₀ → NIC queue pair → rail-leaf → spine → rail-leaf → NIC → GPU₁ on the next host. Every queue, every counter, every place a packet can stall. - 03
Watch the commands. See real output.
Recorded lab walk-throughs. Every command typed, every counter read, every output explained.
Examplerdma link add rxe0 type rxe netdev eth0 brings SoftRoCE up on Linux. Then ib_write_bw saturates 100G across two namespaces. Pause, rewind, copy the command into your own host when you have the gear. - 04
Watch it break.
Real failure modes, recorded and dissected. The wave, the counters, the timing.
ExampleA PFC storm starts on one GPU node. Pause frames flood every uplink in two seconds. Training jobs across the fabric stall. Watch it propagate, then watch the diagnosis. - 05
Read the RCA.
Incident write-ups. What broke, how it was caught, what fix shipped, what stayed broken.
ExampleWhy a 256-GPU job ran at 60% throughput for 11 hours — one ECN profile off by 3 KB on a single leaf. The full diagnosis trail, the patch, the re-deploy.
- 01Phase 01
The machine
What AI training does and what it runs on.
AI Training Basics — collectives, parallelism, MFU · GPU & Server Hardware — NVLink, NVSwitch, PCIe, NIC placement
- 02★Phase 02Course spine
The fabric
The shape of the network around AI clusters. Course spine.
AI Fabric Architecture — Clos, rail-optimised, hash polarisation · Life of an AI Job in Fabric — submit → AllReduce → checkpoint
- 03Phase 03
What rides on the wire
The protocols that move bytes between GPUs.
HPC Networking — animated map of the three · RDMA — kernel bypass, verbs, QPs, MRs · InfiniBand — the native RDMA fabric · RoCE v2 — BTH, RETH, ICRC, ECMP via UDP src port
- 04Phase 04
Making it lossless
The lossless trick — and the configs that actually work.
Transport & Congestion Control — design-space map, escalation ladder · Switch QoS — PFC, ECN, DCQCN, buffer profiles
- 05Phase 05
Host & orchestration
Plug the fabric into the GPU host.
Host Networking — SR-IOV, Multus, nvidia-peermem, NCCL · Linux for Network Engineers · Kubernetes for Network Engineers
- 06Phase 06
Build & operate
Build it, run it, fix it at 3 AM.
HPC Cluster Designs — the 15-layer stack + 5 provisioning patterns · Building a Training Cluster · Inference Networking · Production Operations · Cluster Build Guide — BoM, RCAs, runbooks
AI networking is being built right now.
Build the foundation while the field is still forming.