Life of a Training Job — AI Cluster Deep Dive

⚡ The Problem — Why Normal Networking Fails

Why TCP/IP is too slow for GPU training

The Old Way — TCP/IP Path

GPU HBM Memory

↓ copy to host RAM

CPU / Host RAM

↓ kernel + TCP stack

OS Kernel / Socket Buffer

↓ NIC sends

Network

↓ reverse path

Remote GPU HBM

            ✗ 4× memory copies  
            ✗ CPU always busy  
            ✗ 50–100μs latency
          

VS

The RDMA Way — Zero CPU

GPU HBM Memory

↓ DMA via PCIe (NIC reads directly)

NIC Hardware

↓ RoCEv2 packets

Network (switches)

↓ DMA into remote GPU memory

Remote GPU HBM

CPU 😴 sleeping

            ✓ 0 CPU copies  
            ✓ CPU stays idle  
            ✓ 2–5μs latency
          

Latency Comparison

TCP/IP

~80μs

RDMA

~3μs

At 10,000 GPUs

              Every wasted microsecond × millions of ops = hours of idle GPU time per day

              TCP → GPUs wait on network

              RDMA → GPUs always computing

🚀 Step 1 — Job Submission (SLURM)

sbatch → ranks assigned → 32 processes launch

# User types on their laptop

$ ▋

What SLURM does

1

Allocate nodes

Finds 4 idle servers with 8 GPUs each

2

Assign ranks

Each GPU gets a unique global ID (0–31)

3

Launch processes

32 copies of train.py start simultaneously

4

Set env vars

MASTER_ADDR, WORLD_SIZE, RANK injected

32 GPU Ranks — 4 nodes × 8 GPUs

node01

node02

node03

node04

Rendezvous — processes find each other

Master
node01:29500

TCP rendezvous

            ✓ After rendezvous: TCP lines disappear. RDMA takes over.
          

🔗 RDMA — How Memory Moves Without the CPU

InfiniBand invented RDMA → RoCEv2 brought it to Ethernet

InfiniBand Origins (1999)

            IB is a complete separate network — its own cables, switches, NICs.

            Key invention: Queue Pairs (QP)

            Send Queue — app drops jobs here

            Recv Queue — remote NIC reads

            Completion Queue — NIC signals done

            No CPU involved after setup.

RoCEv2 = IB Transport on Ethernet

            Take IB's reliability engine →

            wrap it in UDP/IP/Ethernet

            Now RDMA is routable across

            spine-leaf switches. No special

            IB switches required.

Queue Pair — The Core of RDMA

IB Reliability Engine (in NIC silicon, not software)

1

PSN Assignment

Each packet gets a sequence number. Packet 1=PSN 5000, packet 2=PSN 5001...

2

Receiver Tracks PSN

Expects 5000, gets 5000 ✓, expects 5001...

3

GAP Detected

Gets 5002 but expected 5001 → sends NAK(5001)

4

Sender Retransmits

Go-Back-N from PSN 5001 onward

5

ACK = Cumulative

"All received up to PSN 11143" — one ACK covers thousands of packets

All PSN/ACK/NAK/retransmit logic lives in NIC hardware. Zero CPU. Zero kernel.

📦 RoCEv2 Packet — Layer by Layer

IB transport wrapped inside UDP/IP/Ethernet

Packet Stack (built bottom-up, click Next to reveal)

Ethernet src/dst MAC — L2 framing

IP Header src=10.0.0.1 dst=10.0.0.2 ECN capable

UDP :4791 dst port 4791 = always RoCEv2

BTH OpCode | QPN=17 | PSN=5000

RETH VA=0x7f800000 RKEY=0x1234 Len=24MB

Payload 4096 bytes of gradient data

ICRC Invariant CRC — integrity check

What Each Header Means

ETHERNET FRAME

Standard L2 framing. Gets rewritten at every hop (switch swaps MAC to next-hop MAC). Invisible to RDMA logic.

IP HEADER

Makes the packet routable — can cross subnets, travel through spine switches. ECMP hashes on src+dst IP. ECN bits live here.

UDP — Port 4791

Just a thin wrapper. Port 4791 is IANA-reserved for RoCEv2. No TCP reliability here — that's in the BTH.

BTH — Base Transport Header (12 bytes)

OpCode: RDMA_WRITE_FIRST / MIDDLE / LAST / ONLY / ACK / NAK
QPN: Which Queue Pair this belongs to
PSN: Packet Sequence Number — the reliability engine

RETH — RDMA Extended Transport Header

Virtual Address: Exactly where in remote GPU HBM to write
RKEY: Capability token — remote NIC validates this before DMA
Length: Total transfer size (24MB in our example)

PAYLOAD + ICRC

Up to 4096 bytes of actual gradient data. ICRC is a CRC over the invariant fields — catches corruption that switches may introduce.

🔄 NCCL — The Choreographer

Decides who sends what to whom. Never touches a packet.

The Stack — Who Lives Where

PyTorch / JAX
calls nccl.AllReduce()

NCCL
who sends what to whom

libibverbs
posts WQEs to NIC

NIC Hardware
packets, ACK, retransmit

Switch Fabric
ECMP, ECN, PFC

NCCL never sees a packet. NIC never knows about AllReduce. Switch never knows about gradients.

Ring AllReduce — 4 GPUs, gradient tensor [A,B,C,D]

Phase Explanation

Phase 1: ReduceScatter

                Each GPU sends one chunk to next GPU in ring.

                Receiver adds (reduces) it to its own chunk.

                After N-1 rounds: each GPU owns one fully summed chunk.

Phase 2: AllGather

                Each GPU sends its complete chunk around the ring.

                After N-1 rounds: every GPU has the complete summed tensor.

                Every link used. No bottleneck.

🔐 QP Setup — Establishing RDMA Connections

Happens once at NCCL init. After this: pure RDMA, zero TCP.

# Node01 side — libibverbs calls

waiting...

QP Setup Sequence

1

ibv_alloc_pd()

Allocate Protection Domain — a security namespace for this process's RDMA operations

2

ibv_reg_mr(gpu_buffer)

Register GPU HBM memory with NIC. NIC pins the pages. Returns RKEY=0xABCD — a capability token

3

ibv_create_qp()

Create Queue Pair. Returns QPN=42. NIC allocates SQ, RQ, CQ in its internal memory

4

Exchange via TCP

Send to node02: QPN=42, RKEY=0xABCD, VA=0x7f000000. Receive back: QPN=17, RKEY=0x1234, VA=0x7f800000

5

ibv_modify_qp() → RTS

Transition QP state: INIT → RTR → RTS (Ready To Send). QP now knows remote QPN + IP. RDMA connection live.

✓ RKEY validates that the sender is authorized to write into that specific memory region. Security at the NIC level.

Result — Connection State

Node01 knows:
remote QPN = 17
remote IP = 10.0.0.2
remote VA = 0x7f800000
remote RKEY = 0x1234
Node02 knows:
remote QPN = 42
remote IP = 10.0.0.1
remote VA = 0x7f000000
remote RKEY = 0xABCD

✈️ Life of a Packet — Node01 GPU7 → Node02 GPU0

24MB gradient chunk, 6144 packets, animated packet journey

NCCLidle

NIC (node01)idle

ToR Switch 1idle

Spine Switchidle

ToR Switch 2idle

NIC (node02)idle

GPU0 HBMwaiting

🚦 Congestion Control — ECN & PFC

How the network signals and handles congestion without dropping packets

ECN — Explicit Congestion Notification (preferred)

            1. Switch buffer fills >50% threshold

            2. Switch marks ECN=CE bit in IP header — packet still forwarded, NOT dropped

            3. Receiver NIC sees CE mark → sends CNP back to sender

            4. Sender NIC runs DCQCN algorithm: cut rate by ~1/5, then increase slowly

✓ No packets dropped. No CPU involved. Rate adjusts in microseconds in NIC hardware.

PFC — Priority Flow Control (emergency brake)

            When a switch buffer is about to overflow completely:

            Switch → sends PAUSE frame upstream

            "Stop sending on priority 3 for 65ms"

            Upstream NIC freezes that traffic class.

            Buffer drains. NIC resumes.

⚠ PFC storms: if misconfigured, pause propagates upstream and deadlocks the entire fabric. Must enable PFC only on RDMA priority class.

Packet Loss — NAK & Retransmit

              Packet PSN=7200 is dropped (rare):

              Receiver: got 7199 ✓, got 7201 → GAP!

              Receiver: sends NAK(7200) to sender

              Sender: stops new sends, retransmits from 7200

              (Go-Back-N: 7200, 7201, 7202...)

✗ One packet loss in AllReduce across 1000 GPUs stalls the entire collective. This is why lossless fabric is mandatory.

✅ Completion — NCCL Gets the Signal

Remote NIC writes to GPU memory, ACK travels back, NCCL moves to next ring step

Completion Sequence

1

All 6144 packets received

Remote NIC (node02) received all packets for the 24MB chunk. PSNs 5000–11143 all acknowledged.

2

RKEY validated

Remote NIC checks RKEY=0x1234 against the registered memory region. Valid ✓. Proceeds to DMA.

3

DMA into GPU HBM

NIC writes 24MB directly into GPU0 HBM at virtual address 0x7f800000. PCIe DMA. CPU = sleeping.

4

ACK sent back

Remote NIC sends ACK(PSN=11143) back to node01 NIC. "All 6144 packets received and written."

5

CQE posted

Node01 NIC receives ACK, posts Completion Queue Element to NCCL's Completion Queue.

6

NCCL polls CQ

NCCL calls ibv_poll_cq(), gets CQE. Transfer complete. Moves to next ring step. No CPU interrupt — polling.

7

AllReduce complete

After 31+31 ring steps, every GPU has the full summed gradient tensor. PyTorch optimizer runs.

Timeline — One Training Step

t=0ms    Forward pass starts
t=200ms  Forward pass done
t=201ms  Backward (gradient compute)
t=400ms  Gradients in GPU HBM
t=401ms  NCCL posts RDMA WQEs
t=401ms  NIC starts sending packets
t=403ms  AllReduce complete ✓
t=403ms  Optimizer step
t=405ms  Next batch starts

AllReduce across 32 GPUs: ~2ms
During that 2ms: zero CPU
involvement in data path.

Who Did What — Final Recap

            SLURM → allocated nodes & ranks

            NCCL  → ring algorithm, WQE posting

            NIC   → packets, PSN, ACK, retransmit

            Switch → ECMP, ECN, PFC

            GPU   → compute & HBM storage

🗺️ Full Picture — Who Owns What

Every layer does exactly one job

Layer	Component	Owns	Does NOT own

All of this happened in under 2 milliseconds.
32 GPUs  ·  4 servers  ·  Terabytes moved per second  ·  Zero CPU in the data path

Packet Journey Summary