Skip to main content

What Training Is

:::info Bridge — what you already know You've sized links, tuned queues, and diagnosed drops for years. You know that the network is either the bottleneck or it isn't. Training is the same problem with a new workload: N GPUs computing in parallel, and a synchronized burst of traffic after every step that determines whether those GPUs are busy or idle. :::


The training loop in one diagram

Ignore the math. From the network's perspective, one training "step" looks like this:

┌─────────────────────────────────────────────────────────────────┐
│ ONE TRAINING STEP │
│ │
│ ① FORWARD PASS ② BACKWARD PASS ③ ALLREDUCE │
│ ───────────────── ────────────────── ───────────── │
│ GPU runs data GPU computes how All N GPUs │
│ through the model. wrong it was and synchronize │
│ All N GPUs do this in which direction their results │
│ simultaneously. to fix it. over network. │
│ │
│ Network: IDLE Network: IDLE Network: BUSY │
│ Duration: ~60–80% Duration: ~15–30% Duration: rest │
│ │
│ ④ OPTIMIZER STEP │
│ ───────────────────── │
│ Each GPU updates its │
│ parameters independently. │
│ Network: IDLE │
└─────────────────────────────────────────────────────────────────┘
Repeat millions of times → trained model

The network is only involved during ③. But ③ is bulk-synchronous — it finishes when the slowest GPU-to-GPU path finishes. One slow link stalls all N GPUs every single step.


What a "gradient" is (network engineer translation)

You don't need to understand the calculus. You need to understand what moves on the wire.

A gradient is a vector of numbers — one number per model parameter. A 7B parameter model produces a 7-billion-element gradient vector after every backward pass. In float32, that's 28 GB per GPU per step.

:::tip The BGP analogy A gradient is to training what a routing update is to BGP. Each GPU computed its own view of "how the model should change" — based on the data it saw. To improve the model, everyone needs to agree on the average view. The network carries that consensus process.

BGP: each router knows its local topology → exchange → converge on shared RIB
Training: each GPU sees its local data → AllReduce → converge on shared gradient :::

The AllReduce takes those 28 GB from every GPU, averages them, and gives every GPU the same result. That averaged gradient is what gets applied to update the model. Now all N GPUs have identical parameters and are ready for the next step.


The numbers that matter

Before going further, here's the scale that makes this a network problem:

ModelParametersGradient size (fp32)Gradient size (bf16)
7B (LLaMA-class)7 billion28 GB14 GB
70B70 billion280 GB140 GB
405B (LLaMA 3)405 billion1.6 TB810 GB
~1.8T (GPT-4 estimate)~1.8 trillion~7.2 TB~3.6 TB

And the cluster sizes where this runs:

Job scaleGPUsNodes (8 GPU/node)Simultaneous AllReduce flows
Small64864 × 63 = 4,032
Medium51264512 × 511 = 261,632
Large4,0965124,096 × 4,095 ≈ 16.7M
Hyperscale32,768+4,096+

Those flows all launch simultaneously. Every step. That's the traffic burst the fabric has to absorb.


The synchronization problem

Training uses a Bulk Synchronous Parallel (BSP) execution model. In BSP:

  1. All N workers compute in parallel (forward + backward pass)
  2. All N workers synchronize — barrier
  3. All N workers update their state
  4. Repeat

The barrier in step 2 is the AllReduce. It's not done until every GPU's gradient contribution has reached every other GPU. The math:

step_time = compute_time + max(allreduce_time across all paths)

Not the average. Not the median. The maximum. The slowest path sets the pace for every GPU in the job.

Eight GPUs in a ring (a collective). Seven are marked done in green; one is still working in red, with a congested link drawn as a dashed red line. Below, a timeline shows seven GPUs finishing quickly while the slow GPU's bar stretches much further — the step ends only when the slow one finishes.

One slow link stalls every GPU in the job. Every step.

This is why a 0.1% packet loss rate that TCP handles gracefully is a training-job killer:

  • TCP: slow start, SACK, retransmit — the connection recovers in tens of milliseconds
  • RDMA AllReduce: PSN gap → NACK → retransmit → all other GPUs at the barrier wait → step time doubles

0.1% loss × 10ms retransmit × step every 500ms = 2% overhead that looks like 20% throughput loss because of the barrier amplification.


MFU — the number that tells you everything

MFU (Model FLOP Utilization) is the ratio of achieved compute to theoretical peak:

MFU = achieved_throughput_tokens_per_second
─────────────────────────────────────
peak_tokens_per_second_at_100%_compute

Healthy numbers on H100 SXM:

  • 55–70% MFU — good. Network is not the bottleneck.
  • 40–55% MFU — investigate. Something is wrong.
  • < 40% MFU — something is definitely wrong.

Low MFU has three causes: slow compute, slow network, or slow storage (data loading). Here's how to isolate them:

# Run nccl-tests to measure actual AllReduce bandwidth
# On all nodes simultaneously:
./build/all_reduce_perf \
-b 1G -e 8G -f 2 \ # test from 1GB to 8GB message sizes
-g 8 \ # 8 GPUs per node
--op sum

# Healthy output (8-node, 8×400G NICs):
# Size Time AlgBW BusBW
# 1073741824 42.78ms 25.09GB/s 44.04GB/s
#
# If BusBW << expected (8 NICs × 400Gbps / 8 = ~400 Gbps):
# → network is the bottleneck.

# Check which NICs NCCL actually found:
NCCL_DEBUG=INFO \
NCCL_DEBUG_SUBSYS=NET \
python -c "import torch.distributed; ..." 2>&1 | grep "NET/IB"

# Healthy: all 8 NICs listed
# NCCL INFO NET/IB : Using [0]mlx5_0 [1]mlx5_1 [2]mlx5_2 [3]mlx5_3
# [4]mlx5_4 [5]mlx5_5 [6]mlx5_6 [7]mlx5_7
#
# Unhealthy: fewer NICs listed — bandwidth cap at partial link count

What the network actually sees

During a training step on a 512-GPU, 7B-model job:

Step 1 — Forward pass (no network traffic):

GPU 0: [batch_0] → embedding → attention → FFN → attention → FFN ... → logits
GPU 1: [batch_1] → embedding → attention → FFN → attention → FFN ... → logits
... (all 512 GPUs running independently)
Network: idle

Step 2 — Backward pass (no network traffic):

GPU 0: logits → ∂loss/∂FFN_weights → ∂loss/∂attention_weights → ...
GPU 1: logits → ∂loss/∂FFN_weights → ...
... (all 512 GPUs computing gradients independently)
Network: idle

Step 3 — AllReduce (the network's moment):

All 512 GPUs simultaneously:
NCCL posts 261,632 QP send operations (512 × 511)
Each: RDMA WRITE of a 28GB gradient chunk to a peer

What hits the fabric:
512 × 28GB ÷ (ring steps) worth of RoCEv2 WRITE packets
All launched within microseconds of each other
Every GPU sending to every other GPU
Every ToR switch sees incast from all upstream nodes
PFC / DCQCN kicks in to prevent buffer overflow

Duration at 8×400G per node: 28GB × 2 ÷ 3.2TB/s ≈ 17ms
But with congestion, ECMP imbalance, or PFC events: 40–100ms

The 17ms vs 100ms gap is where the network engineer's job lives.


What you should be able to do now

  • Explain what happens during a training step and when the network is involved
  • Describe why one slow path stalls all N GPUs (BSP barrier)
  • Calculate the expected gradient size for a model given its parameter count
  • Run nccl-tests and interpret the BusBW number vs expected
  • Identify whether low MFU is compute, network, or storage using the commands above

Where it breaks

A single NIC negotiated 100G instead of 400G. NCCL uses all 8 NICs in a round-robin ring. Every 8th AllReduce chunk flows through the slow NIC at 25% of expected bandwidth. MFU drops to ~75% of healthy.

Detection:

ethtool eth0 | grep Speed
# Should say: Speed: 400000Mb/s
# If it says: Speed: 100000Mb/s → link negotiation failure

2. nvidia-peermem not loaded

Without this kernel module, NCCL cannot DMA from GPU memory directly to the NIC. It falls back to a CPU bounce buffer: GPU HBM → CPU RAM → NIC. Bandwidth drops by 4–5×.

Detection:

lsmod | grep nvidia_peermem
# No output → module not loaded → GPUDirect RDMA broken

# NCCL will log:
# NCCL WARN Cuda memory type: cudaMemoryTypeDevice not supported for GDR

3. NCCL not picking up all NICs

If NCCL_IB_HCA is wrong or the RDMA device plugin didn't mount all NICs into the pod, NCCL falls back to fewer paths. A node with 8 NICs running on 4 gets 50% of expected AllReduce bandwidth.

Detection:

NCCL_DEBUG=INFO python train.py 2>&1 | grep "NET/IB"
# Count the [N]mlx5_X entries — should match your NIC count

4. PFC storm stalling one node's egress

One GPU node's NIC is misconfigured (DCQCN disabled, MTU mismatch, GID index wrong). Its gradient sends are problematic. The ToR switch buffers fill. PFC PAUSE is sent upstream. The sender stops. AllReduce stalls for the duration of the pause. All 511 other GPUs wait.

Detection:

ethtool -S eth0 | grep -i pause
# tx_pause_ctrl_phy: 0 ← sending PAUSE frames (bad if non-zero)
# rx_pause_ctrl_phy: 12847 ← receiving PAUSE (this NIC is being paused)

One RCA — 60% MFU on a 512-GPU job

Symptoms:
A 512-GPU LLaMA-70B training job ran at 58% MFU for 11 hours. Expected was 64%. The team assumed the job was "just running slow" and left it.

What was actually happening:
NCCL logs showed one node (node-047) consistently appearing as the last to complete AllReduce — every step, 40–60ms slower than the median. nccl-tests run on node-047 showed 180 GB/s AllReduce BusBW instead of the expected 380 GB/s.

Root cause:
node-047 had been reimaged after a hardware swap. The NIC firmware config (mlxconfig) that sets ROCE_CC_PRIO_MASK_P1=255 (enabling DCQCN on all priorities) was not applied post-reimage. DCQCN was disabled on that node's NICs. During AllReduce, node-047 sent gradient data at full rate with no rate control. Its egress port at the ToR switch hit 95% utilization. The ToR sent PFC PAUSE frames. Node-047 stopped transmitting. By the time it resumed, all other nodes were waiting at the barrier.

Timeline:

T=0ms node-047 sends AllReduce data at full line rate
T=2ms ToR port buffer hits xoff threshold
T=2ms ToR sends PFC PAUSE to node-047
T=2ms node-047 NIC stops transmitting
T=34ms PFC quanta expires, node-047 resumes
T=34ms AllReduce for all 512 GPUs completes
(28ms slower than it should have been)
T=34ms All 512 GPUs move to optimizer step
T=34ms Repeat next step

28ms extra per step × 8,000 steps per hour = 224 seconds of wasted GPU time per hour per 512 GPUs.

Fix:

# On every node, after any reimage:
mlxconfig -d /dev/mst/mt4125_pciconf0 set \
ROCE_CC_PRIO_MASK_P1=255 \
ROCE_CC_PRIO_MASK_P2=255

# Validate before submitting any job:
./build/all_reduce_perf -b 1G -e 8G -f 2 -g 8
# Run on every node. Flag any node where BusBW < 90% of median.

Lesson: Validate every node individually before a large job. One bad node costs the entire cluster. Add nccl-tests to your compute node provisioning runbook.


What's next

The AllReduce operation above — the one the network has to carry — is implemented by NCCL, which runs on top of RDMA. Before you can understand what goes wrong and how to fix it, you need to understand what RDMA is, how it bypasses the kernel, and what a Queue Pair is.

Next → Collectives — the network's job

Or if you want to jump ahead: RDMA → is where the mechanics of how the bytes actually move are explained.