What Training Is

:::info Bridge — what you already know You've sized links, tuned queues, and diagnosed drops for years. You know that the network is either the bottleneck or it isn't. Training is the same problem with a new workload: N GPUs computing in parallel, and a synchronized burst of traffic after every step that determines whether those GPUs are busy or idle. :::

The training loop in one diagram

Ignore the math. From the network's perspective, one training "step" looks like this:

┌─────────────────────────────────────────────────────────────────┐
│                        ONE TRAINING STEP                        │
│                                                                 │
│  ① FORWARD PASS          ② BACKWARD PASS       ③ ALLREDUCE     │
│  ─────────────────        ──────────────────    ─────────────   │
│  GPU runs data            GPU computes how       All N GPUs     │
│  through the model.       wrong it was and       synchronize    │
│  All N GPUs do this       in which direction     their results  │
│  simultaneously.          to fix it.             over network.  │
│                                                                 │
│  Network: IDLE            Network: IDLE          Network: BUSY  │
│  Duration: ~60–80%        Duration: ~15–30%      Duration: rest │
│                                                                 │
│  ④ OPTIMIZER STEP                                               │
│  ─────────────────────                                          │
│  Each GPU updates its                                           │
│  parameters independently.                                      │
│  Network: IDLE                                                  │
└─────────────────────────────────────────────────────────────────┘
       Repeat millions of times → trained model

The network is only involved during ③. But ③ is bulk-synchronous — it finishes when the slowest GPU-to-GPU path finishes. One slow link stalls all N GPUs every single step.

What a "gradient" is (network engineer translation)

You don't need to understand the calculus. You need to understand what moves on the wire.

A gradient is a vector of numbers — one number per model parameter. A 7B parameter model produces a 7-billion-element gradient vector after every backward pass. In float32, that's 28 GB per GPU per step.

:::tip The BGP analogy A gradient is to training what a routing update is to BGP. Each GPU computed its own view of "how the model should change" — based on the data it saw. To improve the model, everyone needs to agree on the average view. The network carries that consensus process.

BGP: each router knows its local topology → exchange → converge on shared RIB
Training: each GPU sees its local data → AllReduce → converge on shared gradient :::

The AllReduce takes those 28 GB from every GPU, averages them, and gives every GPU the same result. That averaged gradient is what gets applied to update the model. Now all N GPUs have identical parameters and are ready for the next step.

The numbers that matter

Before going further, here's the scale that makes this a network problem:

Model	Parameters	Gradient size (fp32)	Gradient size (bf16)
7B (LLaMA-class)	7 billion	28 GB	14 GB
70B	70 billion	280 GB	140 GB
405B (LLaMA 3)	405 billion	1.6 TB	810 GB
~1.8T (GPT-4 estimate)	~1.8 trillion	~7.2 TB	~3.6 TB

And the cluster sizes where this runs:

Job scale	GPUs	Nodes (8 GPU/node)	Simultaneous AllReduce flows
Small	64	8	64 × 63 = 4,032
Medium	512	64	512 × 511 = 261,632
Large	4,096	512	4,096 × 4,095 ≈ 16.7M
Hyperscale	32,768+	4,096+	—

Those flows all launch simultaneously. Every step. That's the traffic burst the fabric has to absorb.

The synchronization problem

Training uses a Bulk Synchronous Parallel (BSP) execution model. In BSP:

All N workers compute in parallel (forward + backward pass)
All N workers synchronize — barrier
All N workers update their state
Repeat

The barrier in step 2 is the AllReduce. It's not done until every GPU's gradient contribution has reached every other GPU. The math:

step_time = compute_time + max(allreduce_time across all paths)

Not the average. Not the median. The maximum. The slowest path sets the pace for every GPU in the job.

Eight GPUs in a ring (a collective). Seven are marked done in green; one is still working in red, with a congested link drawn as a dashed red line. Below, a timeline shows seven GPUs finishing quickly while the slow GPU's bar stretches much further — the step ends only when the slow one finishes.

One slow link stalls every GPU in the job. Every step.

This is why a 0.1% packet loss rate that TCP handles gracefully is a training-job killer:

TCP: slow start, SACK, retransmit — the connection recovers in tens of milliseconds
RDMA AllReduce: PSN gap → NACK → retransmit → all other GPUs at the barrier wait → step time doubles

0.1% loss × 10ms retransmit × step every 500ms = 2% overhead that looks like 20% throughput loss because of the barrier amplification.

MFU — the number that tells you everything

MFU (Model FLOP Utilization) is the ratio of achieved compute to theoretical peak:

MFU = achieved_throughput_tokens_per_second
      ─────────────────────────────────────
      peak_tokens_per_second_at_100%_compute

Healthy numbers on H100 SXM:

55–70% MFU — good. Network is not the bottleneck.
40–55% MFU — investigate. Something is wrong.
< 40% MFU — something is definitely wrong.

Low MFU has three causes: slow compute, slow network, or slow storage (data loading). Here's how to isolate them:

Is the network the bottleneck?
Is compute the bottleneck?
Is storage / data loading the bottleneck?

# Run nccl-tests to measure actual AllReduce bandwidth
# On all nodes simultaneously:
./build/all_reduce_perf \
  -b 1G -e 8G -f 2 \   # test from 1GB to 8GB message sizes
  -g 8 \                # 8 GPUs per node
  --op sum

# Healthy output (8-node, 8×400G NICs):
#   Size      Time       AlgBW     BusBW
#   1073741824  42.78ms  25.09GB/s  44.04GB/s
#
# If BusBW << expected (8 NICs × 400Gbps / 8 = ~400 Gbps):
# → network is the bottleneck.

# Check which NICs NCCL actually found:
NCCL_DEBUG=INFO \
NCCL_DEBUG_SUBSYS=NET \
python -c "import torch.distributed; ..." 2>&1 | grep "NET/IB"

# Healthy: all 8 NICs listed
# NCCL INFO NET/IB : Using [0]mlx5_0 [1]mlx5_1 [2]mlx5_2 [3]mlx5_3
#                         [4]mlx5_4 [5]mlx5_5 [6]mlx5_6 [7]mlx5_7
#
# Unhealthy: fewer NICs listed — bandwidth cap at partial link count

# Check GPU utilization during a training step
watch -n 0.5 nvidia-smi

# What healthy looks like:
# +-------------------------------------------------------------------+
# | GPU  Name          Pwr:Usage/Cap |    Memory-Usage | GPU-Util ... |
# |   0  H100 SXM5     350W / 700W  | 79400MiB/81920MiB | 98% ...   |
# |   1  H100 SXM5     345W / 700W  | 79400MiB/81920MiB | 97% ...   |
#
# If GPU-Util drops to 20-40% periodically → the GPU is idle waiting
# for the AllReduce → network is the bottleneck.

# Measure compute vs communication breakdown:
# PyTorch profiler shows per-op timing:
import torch.profiler
with torch.profiler.profile(activities=[...], ...):
    model(input).backward()
    optimizer.step()
# Look for: ncclAllReduce time vs forward/backward time

# Check if GPUs idle at step boundaries (data loading gaps):
# Profiler will show gaps between training steps.

# Quick check: pin_memory and num_workers
# DataLoader(dataset, num_workers=4, pin_memory=True)

# Check DALI or other prefetch solutions if CPU preprocessing
# is the bottleneck.

# Storage bandwidth check:
iostat -x 1  # on compute nodes
# If await > 10ms or %util > 90% on storage paths → data loading bottleneck

What the network actually sees

During a training step on a 512-GPU, 7B-model job:

Step 1 — Forward pass (no network traffic):

GPU 0: [batch_0] → embedding → attention → FFN → attention → FFN ... → logits
GPU 1: [batch_1] → embedding → attention → FFN → attention → FFN ... → logits
...                  (all 512 GPUs running independently)
Network: idle

Step 2 — Backward pass (no network traffic):

GPU 0: logits → ∂loss/∂FFN_weights → ∂loss/∂attention_weights → ...
GPU 1: logits → ∂loss/∂FFN_weights → ...
...                  (all 512 GPUs computing gradients independently)
Network: idle

Step 3 — AllReduce (the network's moment):

All 512 GPUs simultaneously:
  NCCL posts 261,632 QP send operations (512 × 511)
  Each: RDMA WRITE of a 28GB gradient chunk to a peer

What hits the fabric:
  512 × 28GB ÷ (ring steps) worth of RoCEv2 WRITE packets
  All launched within microseconds of each other
  Every GPU sending to every other GPU
  Every ToR switch sees incast from all upstream nodes
  PFC / DCQCN kicks in to prevent buffer overflow

Duration at 8×400G per node: 28GB × 2 ÷ 3.2TB/s ≈ 17ms
But with congestion, ECMP imbalance, or PFC events: 40–100ms

The 17ms vs 100ms gap is where the network engineer's job lives.

What you should be able to do now

Explain what happens during a training step and when the network is involved
Describe why one slow path stalls all N GPUs (BSP barrier)
Calculate the expected gradient size for a model given its parameter count
Run nccl-tests and interpret the BusBW number vs expected
Identify whether low MFU is compute, network, or storage using the commands above

Where it breaks

1. One NIC running at wrong link speed

A single NIC negotiated 100G instead of 400G. NCCL uses all 8 NICs in a round-robin ring. Every 8th AllReduce chunk flows through the slow NIC at 25% of expected bandwidth. MFU drops to ~75% of healthy.

Detection:

ethtool eth0 | grep Speed
# Should say: Speed: 400000Mb/s
# If it says: Speed: 100000Mb/s → link negotiation failure

2. `nvidia-peermem` not loaded

Without this kernel module, NCCL cannot DMA from GPU memory directly to the NIC. It falls back to a CPU bounce buffer: GPU HBM → CPU RAM → NIC. Bandwidth drops by 4–5×.

Detection:

lsmod | grep nvidia_peermem
# No output → module not loaded → GPUDirect RDMA broken

# NCCL will log:
# NCCL WARN Cuda memory type: cudaMemoryTypeDevice not supported for GDR

3. NCCL not picking up all NICs

If NCCL_IB_HCA is wrong or the RDMA device plugin didn't mount all NICs into the pod, NCCL falls back to fewer paths. A node with 8 NICs running on 4 gets 50% of expected AllReduce bandwidth.

Detection:

NCCL_DEBUG=INFO python train.py 2>&1 | grep "NET/IB"
# Count the [N]mlx5_X entries — should match your NIC count

4. PFC storm stalling one node's egress

One GPU node's NIC is misconfigured (DCQCN disabled, MTU mismatch, GID index wrong). Its gradient sends are problematic. The ToR switch buffers fill. PFC PAUSE is sent upstream. The sender stops. AllReduce stalls for the duration of the pause. All 511 other GPUs wait.

Detection:

ethtool -S eth0 | grep -i pause
# tx_pause_ctrl_phy: 0       ← sending PAUSE frames (bad if non-zero)
# rx_pause_ctrl_phy: 12847   ← receiving PAUSE (this NIC is being paused)

One RCA — 60% MFU on a 512-GPU job

Symptoms:
A 512-GPU LLaMA-70B training job ran at 58% MFU for 11 hours. Expected was 64%. The team assumed the job was "just running slow" and left it.

What was actually happening:
NCCL logs showed one node (node-047) consistently appearing as the last to complete AllReduce — every step, 40–60ms slower than the median. nccl-tests run on node-047 showed 180 GB/s AllReduce BusBW instead of the expected 380 GB/s.

Root cause:
node-047 had been reimaged after a hardware swap. The NIC firmware config (mlxconfig) that sets ROCE_CC_PRIO_MASK_P1=255 (enabling DCQCN on all priorities) was not applied post-reimage. DCQCN was disabled on that node's NICs. During AllReduce, node-047 sent gradient data at full rate with no rate control. Its egress port at the ToR switch hit 95% utilization. The ToR sent PFC PAUSE frames. Node-047 stopped transmitting. By the time it resumed, all other nodes were waiting at the barrier.

Timeline:

T=0ms    node-047 sends AllReduce data at full line rate
T=2ms    ToR port buffer hits xoff threshold
T=2ms    ToR sends PFC PAUSE to node-047
T=2ms    node-047 NIC stops transmitting
T=34ms   PFC quanta expires, node-047 resumes
T=34ms   AllReduce for all 512 GPUs completes
         (28ms slower than it should have been)
T=34ms   All 512 GPUs move to optimizer step
T=34ms   Repeat next step

28ms extra per step × 8,000 steps per hour = 224 seconds of wasted GPU time per hour per 512 GPUs.

Fix:

# On every node, after any reimage:
mlxconfig -d /dev/mst/mt4125_pciconf0 set \
  ROCE_CC_PRIO_MASK_P1=255 \
  ROCE_CC_PRIO_MASK_P2=255

# Validate before submitting any job:
./build/all_reduce_perf -b 1G -e 8G -f 2 -g 8
# Run on every node. Flag any node where BusBW < 90% of median.

Lesson: Validate every node individually before a large job. One bad node costs the entire cluster. Add nccl-tests to your compute node provisioning runbook.

What's next

The AllReduce operation above — the one the network has to carry — is implemented by NCCL, which runs on top of RDMA. Before you can understand what goes wrong and how to fix it, you need to understand what RDMA is, how it bypasses the kernel, and what a Queue Pair is.

Next → Collectives — the network's job

Or if you want to jump ahead: RDMA → is where the mechanics of how the bytes actually move are explained.

The training loop in one diagram​

What a "gradient" is (network engineer translation)​

The numbers that matter​

The synchronization problem​

MFU — the number that tells you everything​

What the network actually sees​

What you should be able to do now​

Where it breaks​

1. One NIC running at wrong link speed​

2. nvidia-peermem not loaded​

3. NCCL not picking up all NICs​

4. PFC storm stalling one node's egress​

One RCA — 60% MFU on a 512-GPU job​

What's next​

The training loop in one diagram

What a "gradient" is (network engineer translation)

The numbers that matter

The synchronization problem

MFU — the number that tells you everything

What the network actually sees

What you should be able to do now

Where it breaks

1. One NIC running at wrong link speed

2. `nvidia-peermem` not loaded

3. NCCL not picking up all NICs

4. PFC storm stalling one node's egress

One RCA — 60% MFU on a 512-GPU job

What's next