✓ After rendezvous: TCP lines disappear. RDMA takes over.
🔗 RDMA — How Memory Moves Without the CPU
InfiniBand invented RDMA → RoCEv2 brought it to Ethernet
InfiniBand Origins (1999)
IB is a complete separate network — its own cables, switches, NICs.
Key invention: Queue Pairs (QP)
Send Queue — app drops jobs here Recv Queue — remote NIC reads Completion Queue — NIC signals done
No CPU involved after setup.
RoCEv2 = IB Transport on Ethernet
Take IB's reliability engine →
wrap it in UDP/IP/Ethernet
Now RDMA is routable across
spine-leaf switches. No special
IB switches required.
Queue Pair — The Core of RDMA
IB Reliability Engine (in NIC silicon, not software)
1
PSN Assignment
Each packet gets a sequence number. Packet 1=PSN 5000, packet 2=PSN 5001...
2
Receiver Tracks PSN
Expects 5000, gets 5000 ✓, expects 5001...
3
GAP Detected
Gets 5002 but expected 5001 → sends NAK(5001)
4
Sender Retransmits
Go-Back-N from PSN 5001 onward
5
ACK = Cumulative
"All received up to PSN 11143" — one ACK covers thousands of packets
All PSN/ACK/NAK/retransmit logic lives in NIC hardware. Zero CPU. Zero kernel.
📦 RoCEv2 Packet — Layer by Layer
IB transport wrapped inside UDP/IP/Ethernet
Packet Stack (built bottom-up, click Next to reveal)
Ethernetsrc/dst MAC — L2 framing
IP Headersrc=10.0.0.1 dst=10.0.0.2 ECN capable
UDP :4791dst port 4791 = always RoCEv2
BTHOpCode | QPN=17 | PSN=5000
RETHVA=0x7f800000 RKEY=0x1234 Len=24MB
Payload4096 bytes of gradient data
ICRCInvariant CRC — integrity check
What Each Header Means
ETHERNET FRAME
Standard L2 framing. Gets rewritten at every hop (switch swaps MAC to next-hop MAC). Invisible to RDMA logic.
IP HEADER
Makes the packet routable — can cross subnets, travel through spine switches. ECMP hashes on src+dst IP. ECN bits live here.
UDP — Port 4791
Just a thin wrapper. Port 4791 is IANA-reserved for RoCEv2. No TCP reliability here — that's in the BTH.
BTH — Base Transport Header (12 bytes)
OpCode: RDMA_WRITE_FIRST / MIDDLE / LAST / ONLY / ACK / NAK QPN: Which Queue Pair this belongs to PSN: Packet Sequence Number — the reliability engine
RETH — RDMA Extended Transport Header
Virtual Address: Exactly where in remote GPU HBM to write RKEY: Capability token — remote NIC validates this before DMA Length: Total transfer size (24MB in our example)
PAYLOAD + ICRC
Up to 4096 bytes of actual gradient data. ICRC is a CRC over the invariant fields — catches corruption that switches may introduce.
🔄 NCCL — The Choreographer
Decides who sends what to whom. Never touches a packet.
The Stack — Who Lives Where
PyTorch / JAX calls nccl.AllReduce()
NCCL who sends what to whom
libibverbs posts WQEs to NIC
NIC Hardware packets, ACK, retransmit
Switch Fabric ECMP, ECN, PFC
NCCL never sees a packet. NIC never knows about AllReduce. Switch never knows about gradients.
Ring AllReduce — 4 GPUs, gradient tensor [A,B,C,D]
Phase Explanation
Phase 1: ReduceScatter
Each GPU sends one chunk to next GPU in ring.
Receiver adds (reduces) it to its own chunk.
After N-1 rounds: each GPU owns one fully summed chunk.
Phase 2: AllGather
Each GPU sends its complete chunk around the ring.
After N-1 rounds: every GPU has the complete summed tensor.
Every link used. No bottleneck.
🔐 QP Setup — Establishing RDMA Connections
Happens once at NCCL init. After this: pure RDMA, zero TCP.
# Node01 side — libibverbs calls
waiting...
QP Setup Sequence
1
ibv_alloc_pd()
Allocate Protection Domain — a security namespace for this process's RDMA operations
2
ibv_reg_mr(gpu_buffer)
Register GPU HBM memory with NIC. NIC pins the pages. Returns RKEY=0xABCD — a capability token
3
ibv_create_qp()
Create Queue Pair. Returns QPN=42. NIC allocates SQ, RQ, CQ in its internal memory
1. Switch buffer fills >50% threshold
2. Switch marks ECN=CE bit in IP header — packet still forwarded, NOT dropped
3. Receiver NIC sees CE mark → sends CNP back to sender
4. Sender NIC runs DCQCN algorithm: cut rate by ~1/5, then increase slowly
✓ No packets dropped. No CPU involved. Rate adjusts in microseconds in NIC hardware.
PFC — Priority Flow Control (emergency brake)
When a switch buffer is about to overflow completely:
Switch → sends PAUSE frame upstream
"Stop sending on priority 3 for 65ms"
Upstream NIC freezes that traffic class.
Buffer drains. NIC resumes.
⚠ PFC storms: if misconfigured, pause propagates upstream and deadlocks the entire fabric. Must enable PFC only on RDMA priority class.
Packet Loss — NAK & Retransmit
Packet PSN=7200 is dropped (rare):
Receiver: got 7199 ✓, got 7201 → GAP!
Receiver: sends NAK(7200) to sender
Sender: stops new sends, retransmits from 7200
(Go-Back-N: 7200, 7201, 7202...)
✗ One packet loss in AllReduce across 1000 GPUs stalls the entire collective. This is why lossless fabric is mandatory.
✅ Completion — NCCL Gets the Signal
Remote NIC writes to GPU memory, ACK travels back, NCCL moves to next ring step
Completion Sequence
1
All 6144 packets received
Remote NIC (node02) received all packets for the 24MB chunk. PSNs 5000–11143 all acknowledged.
2
RKEY validated
Remote NIC checks RKEY=0x1234 against the registered memory region. Valid ✓. Proceeds to DMA.
3
DMA into GPU HBM
NIC writes 24MB directly into GPU0 HBM at virtual address 0x7f800000. PCIe DMA. CPU = sleeping.
4
ACK sent back
Remote NIC sends ACK(PSN=11143) back to node01 NIC. "All 6144 packets received and written."
5
CQE posted
Node01 NIC receives ACK, posts Completion Queue Element to NCCL's Completion Queue.
6
NCCL polls CQ
NCCL calls ibv_poll_cq(), gets CQE. Transfer complete. Moves to next ring step. No CPU interrupt — polling.
7
AllReduce complete
After 31+31 ring steps, every GPU has the full summed gradient tensor. PyTorch optimizer runs.
Timeline — One Training Step
t=0ms Forward pass starts
t=200ms Forward pass done
t=201ms Backward (gradient compute)
t=400ms Gradients in GPU HBM
t=401ms NCCL posts RDMA WQEs
t=401ms NIC starts sending packets
t=403ms AllReduce complete ✓
t=403ms Optimizer step
t=405ms Next batch starts
AllReduce across 32 GPUs: ~2ms
During that 2ms: zero CPU
involvement in data path.