Lossless Network

Sound familiar?

Your company is building GPU clusters. You're not in the room.

ML teams talk about NCCL, RDMA, PFC storms. You nod and Google later.

AI networking roles are opening everywhere. You don't qualify — yet.

After Lossless Network.

From Network Engineer to AI Network Engineer — left side shows a traditional network engineer surrounded by BGP, OSPF, leaf-spine, copper and fiber cables, switches and routers. Right side shows the same engineer surrounded by GPU servers, NVLink, ConnectX-8 NICs, SR-IOV virtual functions, Multus, Kubernetes pods, NCCL AllReduce ring, lossless RoCEv2 fabric, Prometheus, Grafana, and Volcano scheduler. — From left to right — from *"what's that?"* to *"I built that."*

"What's RDMA?"

→

Walk verbs · queue pairs · memory regions through NCCL

"Why is training slow?"

→

Diagnose PFC storms, ECN misconfig, NCCL timeouts

"I don't know Kubernetes"

→

Deploy Multus pods with RDMA interfaces

"Can't contribute to AI fabric design"

→

Design rail-optimized topology for 1000+ GPUs

Life of a Training Job.

WATCH FIRST · 10 SCENES

From sbatch on a laptop to a packet landing in a remote GPU's HBM, end to end. Every layer — SLURM · NCCL · libibverbs · NIC hardware · RoCEv2 headers · ToR · spine · ECN · PFC · RKEY · DMA — animated.

Python → RDMARing AllReduceRoCEv2 packet buildECN · PFCQP setupLost packet recovery

▶ Start the journey~4 min · interactive · opens in new tab

How it works.

5 steps. Theory → RCA.

01
Read the theory. In your language.
Every concept bridged to networking you already know. No ML jargon without a translation.
ExamplePFC = backpressure on your switch ports. SR-IOV = VRF for NICs. Multus = secondary network attachments. NCCL ring AllReduce = a token bus, but for gradients.
02
See the full picture.
The complete AI fabric, walked end to end. Annotated diagrams, every hop labeled.
ExampleA single AllReduce flow: GPU₀ → NIC queue pair → rail-leaf → spine → rail-leaf → NIC → GPU₁ on the next host. Every queue, every counter, every place a packet can stall.
03
Watch the commands. See real output.
Recorded lab walk-throughs. Every command typed, every counter read, every output explained.
Examplerdma link add rxe0 type rxe netdev eth0 brings SoftRoCE up on Linux. Then ib_write_bw saturates 100G across two namespaces. Pause, rewind, copy the command into your own host when you have the gear.
04
Watch it break.
Real failure modes, recorded and dissected. The wave, the counters, the timing.
ExampleA PFC storm starts on one GPU node. Pause frames flood every uplink in two seconds. Training jobs across the fabric stall. Watch it propagate, then watch the diagnosis.
05
Read the RCA.
Incident write-ups. What broke, how it was caught, what fix shipped, what stayed broken.
ExampleWhy a 256-GPU job ran at 60% throughput for 11 hours — one ECN profile off by 3 KB on a single leaf. The full diagnosis trail, the patch, the re-deploy.

The path.

5 phases. AI → Fabric → Host → Ops → Capstone.See the curriculum →

01
Phase 01
The machine
What AI training does and what it runs on.
AI Training Basics — collectives, parallelism, MFU · GPU & Server Hardware — NVLink, NVSwitch, PCIe, NIC placement
02★
Phase 02Course spine
The fabric
The shape of the network around AI clusters. Course spine.
AI Fabric Architecture — Clos, rail-optimised, hash polarisation · Life of an AI Job in Fabric — submit → AllReduce → checkpoint
03
Phase 03
What rides on the wire
The protocols that move bytes between GPUs.
HPC Networking — animated map of the three · RDMA — kernel bypass, verbs, QPs, MRs · InfiniBand — the native RDMA fabric · RoCE v2 — BTH, RETH, ICRC, ECMP via UDP src port
04
Phase 04
Making it lossless
The lossless trick — and the configs that actually work.
Transport & Congestion Control — design-space map, escalation ladder · Switch QoS — PFC, ECN, DCQCN, buffer profiles
05
Phase 05
Host & orchestration
Plug the fabric into the GPU host.
Host Networking — SR-IOV, Multus, nvidia-peermem, NCCL · Linux for Network Engineers · Kubernetes for Network Engineers
06
Phase 06
Build & operate
Build it, run it, fix it at 3 AM.
HPC Cluster Designs — the 15-layer stack + 5 provisioning patterns · Building a Training Cluster · Inference Networking · Production Operations · Cluster Build Guide — BoM, RCAs, runbooks

Sound familiar?

After Lossless Network.

Life of a Training Job.

How it works.

Read the theory. In your language.

See the full picture.

Watch the commands. See real output.

Watch it break.

Read the RCA.

The path.

The machine

The fabric

What rides on the wire

Making it lossless

Host & orchestration

Build & operate