← Lossless Network ▶ LIFE OF A TRAINING JOB — AI Cluster Networking
1 / 10
⚡ The Problem — Why Normal Networking Fails
Why TCP/IP is too slow for GPU training
The Old Way — TCP/IP Path
GPU HBM Memory
↓ copy to host RAM
CPU / Host RAM
↓ kernel + TCP stack
OS Kernel / Socket Buffer
↓ NIC sends
Network
↓ reverse path
Remote GPU HBM
4× memory copies   CPU always busy   50–100μs latency
VS
The RDMA Way — Zero CPU
GPU HBM Memory
↓ DMA via PCIe (NIC reads directly)
NIC Hardware
↓ RoCEv2 packets
Network (switches)
↓ DMA into remote GPU memory
Remote GPU HBM
CPU 😴 sleeping
0 CPU copies   CPU stays idle   2–5μs latency
Latency Comparison
TCP/IP
~80μs
RDMA
~3μs
At 10,000 GPUs
Every wasted microsecond × millions of ops = hours of idle GPU time per day

TCP → GPUs wait on network
RDMA → GPUs always computing
🚀 Step 1 — Job Submission (SLURM)
sbatch → ranks assigned → 32 processes launch
# User types on their laptop
$
What SLURM does
1
Allocate nodes
Finds 4 idle servers with 8 GPUs each
2
Assign ranks
Each GPU gets a unique global ID (0–31)
3
Launch processes
32 copies of train.py start simultaneously
4
Set env vars
MASTER_ADDR, WORLD_SIZE, RANK injected
32 GPU Ranks — 4 nodes × 8 GPUs
node01
node02
node03
node04
Rendezvous — processes find each other
Master
node01:29500
TCP rendezvous
✓ After rendezvous: TCP lines disappear. RDMA takes over.
🔗 RDMA — How Memory Moves Without the CPU
InfiniBand invented RDMA → RoCEv2 brought it to Ethernet
InfiniBand Origins (1999)
IB is a complete separate network — its own cables, switches, NICs.

Key invention: Queue Pairs (QP)

Send Queue — app drops jobs here
Recv Queue — remote NIC reads
Completion Queue — NIC signals done

No CPU involved after setup.
RoCEv2 = IB Transport on Ethernet
Take IB's reliability engine →
wrap it in UDP/IP/Ethernet

Now RDMA is routable across
spine-leaf switches. No special
IB switches required.
Queue Pair — The Core of RDMA
NIC — Node A Send Queue (SQ) Completion Queue App polls CQ for "done" RoCEv2 NIC — Node B Recv Queue (RQ) DMA → GPU HBM Sends ACK back
IB Reliability Engine (in NIC silicon, not software)
1
PSN Assignment
Each packet gets a sequence number. Packet 1=PSN 5000, packet 2=PSN 5001...
2
Receiver Tracks PSN
Expects 5000, gets 5000 ✓, expects 5001...
3
GAP Detected
Gets 5002 but expected 5001 → sends NAK(5001)
4
Sender Retransmits
Go-Back-N from PSN 5001 onward
5
ACK = Cumulative
"All received up to PSN 11143" — one ACK covers thousands of packets
All PSN/ACK/NAK/retransmit logic lives in NIC hardware. Zero CPU. Zero kernel.
📦 RoCEv2 Packet — Layer by Layer
IB transport wrapped inside UDP/IP/Ethernet
Packet Stack (built bottom-up, click Next to reveal)
Ethernet src/dst MAC — L2 framing
IP Header src=10.0.0.1 dst=10.0.0.2 ECN capable
UDP :4791 dst port 4791 = always RoCEv2
BTH OpCode | QPN=17 | PSN=5000
RETH VA=0x7f800000 RKEY=0x1234 Len=24MB
Payload 4096 bytes of gradient data
ICRC Invariant CRC — integrity check
What Each Header Means
ETHERNET FRAME
Standard L2 framing. Gets rewritten at every hop (switch swaps MAC to next-hop MAC). Invisible to RDMA logic.
IP HEADER
Makes the packet routable — can cross subnets, travel through spine switches. ECMP hashes on src+dst IP. ECN bits live here.
UDP — Port 4791
Just a thin wrapper. Port 4791 is IANA-reserved for RoCEv2. No TCP reliability here — that's in the BTH.
BTH — Base Transport Header (12 bytes)
OpCode: RDMA_WRITE_FIRST / MIDDLE / LAST / ONLY / ACK / NAK
QPN: Which Queue Pair this belongs to
PSN: Packet Sequence Number — the reliability engine
RETH — RDMA Extended Transport Header
Virtual Address: Exactly where in remote GPU HBM to write
RKEY: Capability token — remote NIC validates this before DMA
Length: Total transfer size (24MB in our example)
PAYLOAD + ICRC
Up to 4096 bytes of actual gradient data. ICRC is a CRC over the invariant fields — catches corruption that switches may introduce.
🔄 NCCL — The Choreographer
Decides who sends what to whom. Never touches a packet.
The Stack — Who Lives Where
PyTorch / JAX
calls nccl.AllReduce()
NCCL
who sends what to whom
libibverbs
posts WQEs to NIC
NIC Hardware
packets, ACK, retransmit
Switch Fabric
ECMP, ECN, PFC
NCCL never sees a packet. NIC never knows about AllReduce. Switch never knows about gradients.
Ring AllReduce — 4 GPUs, gradient tensor [A,B,C,D]
Phase Explanation
Phase 1: ReduceScatter
Each GPU sends one chunk to next GPU in ring.
Receiver adds (reduces) it to its own chunk.
After N-1 rounds: each GPU owns one fully summed chunk.
Phase 2: AllGather
Each GPU sends its complete chunk around the ring.
After N-1 rounds: every GPU has the complete summed tensor.
Every link used. No bottleneck.
🔐 QP Setup — Establishing RDMA Connections
Happens once at NCCL init. After this: pure RDMA, zero TCP.
# Node01 side — libibverbs calls
waiting...
QP Setup Sequence
1
ibv_alloc_pd()
Allocate Protection Domain — a security namespace for this process's RDMA operations
2
ibv_reg_mr(gpu_buffer)
Register GPU HBM memory with NIC. NIC pins the pages. Returns RKEY=0xABCD — a capability token
3
ibv_create_qp()
Create Queue Pair. Returns QPN=42. NIC allocates SQ, RQ, CQ in its internal memory
4
Exchange via TCP
Send to node02: QPN=42, RKEY=0xABCD, VA=0x7f000000. Receive back: QPN=17, RKEY=0x1234, VA=0x7f800000
5
ibv_modify_qp() → RTS
Transition QP state: INIT → RTR → RTS (Ready To Send). QP now knows remote QPN + IP. RDMA connection live.
✓ RKEY validates that the sender is authorized to write into that specific memory region. Security at the NIC level.
Result — Connection State
Node01 knows:
remote QPN = 17
remote IP = 10.0.0.2
remote VA = 0x7f800000
remote RKEY = 0x1234
Node02 knows:
remote QPN = 42
remote IP = 10.0.0.1
remote VA = 0x7f000000
remote RKEY = 0xABCD
✈️ Life of a Packet — Node01 GPU7 → Node02 GPU0
24MB gradient chunk, 6144 packets, animated packet journey
NCCLidle
NIC (node01)idle
ToR Switch 1idle
Spine Switchidle
ToR Switch 2idle
NIC (node02)idle
GPU0 HBMwaiting
🚦 Congestion Control — ECN & PFC
How the network signals and handles congestion without dropping packets
ECN — Explicit Congestion Notification (preferred)
Sender NIC Switch buf > 50% Recv NIC marks ECN=CE bit CNP packet (Congestion Notification)
1. Switch buffer fills >50% threshold
2. Switch marks ECN=CE bit in IP header — packet still forwarded, NOT dropped
3. Receiver NIC sees CE mark → sends CNP back to sender
4. Sender NIC runs DCQCN algorithm: cut rate by ~1/5, then increase slowly
✓ No packets dropped. No CPU involved. Rate adjusts in microseconds in NIC hardware.
PFC — Priority Flow Control (emergency brake)
When a switch buffer is about to overflow completely:

Switch → sends PAUSE frame upstream
"Stop sending on priority 3 for 65ms"

Upstream NIC freezes that traffic class.
Buffer drains. NIC resumes.
⚠ PFC storms: if misconfigured, pause propagates upstream and deadlocks the entire fabric. Must enable PFC only on RDMA priority class.
Packet Loss — NAK & Retransmit
Packet PSN=7200 is dropped (rare):

Receiver: got 7199 ✓, got 7201 → GAP!
Receiver: sends NAK(7200) to sender
Sender: stops new sends, retransmits from 7200
(Go-Back-N: 7200, 7201, 7202...)
✗ One packet loss in AllReduce across 1000 GPUs stalls the entire collective. This is why lossless fabric is mandatory.
✅ Completion — NCCL Gets the Signal
Remote NIC writes to GPU memory, ACK travels back, NCCL moves to next ring step
Completion Sequence
1
All 6144 packets received
Remote NIC (node02) received all packets for the 24MB chunk. PSNs 5000–11143 all acknowledged.
2
RKEY validated
Remote NIC checks RKEY=0x1234 against the registered memory region. Valid ✓. Proceeds to DMA.
3
DMA into GPU HBM
NIC writes 24MB directly into GPU0 HBM at virtual address 0x7f800000. PCIe DMA. CPU = sleeping.
4
ACK sent back
Remote NIC sends ACK(PSN=11143) back to node01 NIC. "All 6144 packets received and written."
5
CQE posted
Node01 NIC receives ACK, posts Completion Queue Element to NCCL's Completion Queue.
6
NCCL polls CQ
NCCL calls ibv_poll_cq(), gets CQE. Transfer complete. Moves to next ring step. No CPU interrupt — polling.
7
AllReduce complete
After 31+31 ring steps, every GPU has the full summed gradient tensor. PyTorch optimizer runs.
Timeline — One Training Step
t=0ms    Forward pass starts
t=200ms Forward pass done
t=201ms Backward (gradient compute)
t=400ms Gradients in GPU HBM
t=401ms NCCL posts RDMA WQEs
t=401ms NIC starts sending packets
t=403ms AllReduce complete ✓
t=403ms Optimizer step
t=405ms Next batch starts
AllReduce across 32 GPUs: ~2ms
During that 2ms: zero CPU
involvement in data path.
Who Did What — Final Recap
SLURM → allocated nodes & ranks
NCCL  → ring algorithm, WQE posting
NIC   → packets, PSN, ACK, retransmit
Switch → ECMP, ECN, PFC
GPU   → compute & HBM storage
🗺️ Full Picture — Who Owns What
Every layer does exactly one job
LayerComponentOwnsDoes NOT own
All of this happened in under 2 milliseconds.
32 GPUs  ·  4 servers  ·  Terabytes moved per second  ·  Zero CPU in the data path
Packet Journey Summary
▶ narrator
Welcome. This is the full journey of a GPU training job — from a single Python command to packets flowing through real hardware.
0:00 / 3:52