When Training Slows
When MFU drops, the fabric is the loudest suspect — but rarely the culprit if PFC and ECN counters are clean.
The hardest production incidents are the ones where the network is fine and something else is broken. The reflex you need is "rule the fabric out fast, then look elsewhere." Here's the order.
MFU in one sentence
Model FLOP Utilization = fraction of theoretical peak GPU compute actually doing useful work.
ML engineers compute it; you read it:
| MFU | Health | What you do |
|---|---|---|
| 65–72% | Excellent | Record as baseline |
| 55–65% | Good | Watch trends, no action |
| 45–55% | Investigate | Diagnosis ladder below |
| 35–45% | Significant issue | Network OR storage hurting you |
| < 35% | Severe | Something fundamental is wrong |
Rule of thumb: below 60% on a healthy cluster, something is wrong.
(MoE models often run lower MFU by design — sparse compute. Adjust expectations.)
The diagnosis ladder
Work top-down. Each step rules out a class of cause before going deeper.
- Is it compute or communication?
- If communication — is it NCCL config, or the fabric?
- If the fabric — which component? (NIC, ToR, spine, ECMP)
- If a NIC — which node?
Don't skip steps. Most "the fabric is slow" tickets end at step 1.
Step 1 — Compute or communication?
The fastest signal you can read in five seconds:
watch -n 0.1 "nvidia-smi --query-gpu=utilization.gpu --format=csv,noheader"
- Steady at 97–98% → compute-bound. Your network is fine. Look at the model.
- Drops to ~10% periodically → communication is the bottleneck. Duration of low utilization = duration of AllReduce wait.
That's it. One command separates "your problem" from "not your problem."
Step 2 — NCCL config or fabric?
Check NCCL FIRST before blaming the fabric. The most common cause of 50% AllReduce bandwidth is misconfiguration, not silicon.
Are all 8 NICs visible to NCCL?
NCCL_DEBUG=INFO python train.py 2>&1 | grep "NET/IB"
# Should show all 8:
# NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE ... [7]mlx5_7:1/RoCE
# Only 4 → NCCL_IB_HCA env var is wrong → bandwidth capped at 50%
Is GPUDirect RDMA active?
NCCL_DEBUG=INFO python ... 2>&1 | grep -i "gdr\|peermem"
# Must NOT see: "Cuda memory type ... not supported for GDR"
# If seen: modprobe nvidia_peermem
Then run the canonical benchmark:
mpirun -np 512 --hostfile hosts ./build/all_reduce_perf -b 1G -e 8G -f 2
# Expected for 8×400G NICs: ~380 GB/s busBW
# 200 GB/s → NIC or GDR problem (back to NCCL config)
# 380 GB/s with slow training → it's not the fabric
If nccl-tests hits 380 GB/s but training is still slow, the network is fine. Go check the model, the data loader, the host. Stop here.
Step 3 — Which fabric component?
NCCL is fine and nccl-tests is slow. Now you're looking at the fabric. Three counter checks, in order:
Link speeds — any NIC below 400G?
for n in 0 1 2 3 4 5 6 7; do ethtool eth$n | grep Speed; done
# Speed: 100000Mb/s on any NIC → that node's BW drops by 25%
PFC events — sustained or spiking?
ethtool -S eth0 | grep -E "pause|pfc"
# tx_pause_ctrl_phy sustained in thousands/sec = PFC storm or DCQCN misconfig
# A few/sec is normal backpressure
ECN/CNP — is congestion getting marked?
ethtool -S eth0 | grep -E "cnp|ecn"
# np_cnp_sent climbing → this NIC sits in a congested path
# rp_cnp_handled climbing → this NIC is being asked to slow down
Spine balance — ECMP working?
On each spine: show interfaces counters rates
If spine-1 at 90% and spine-4 at 30%: hash polarization or elephant collision
Mitigation: NCCL_IB_QPS_PER_CONNECTION=4 (more entropy)
Or enable adaptive routing if your fabric supports it.
Step 4 — Which node?
Fabric looks globally healthy but training shows one rank consistently slow. Add per-rank timing:
# In the training script
dist.barrier(); start = time.perf_counter()
dist.all_reduce(tensor, op=dist.ReduceOp.SUM)
torch.cuda.synchronize(); elapsed = time.perf_counter() - start
timings = [None] * dist.get_world_size()
dist.all_gather_object(timings, elapsed)
if dist.get_rank() == 0:
print(f"slowest rank: {np.argmax(timings)} at "
f"{max(timings):.3f}s vs median {np.median(timings):.3f}s")
When one rank is 2–5× slower than median, that node has the problem. Check its NICs, its ToR port, its GPU temps. Drain it if it can't be fixed live.
One real RCA — 58% MFU, clean fabric, one env var
A 128-GPU cluster came up at 58% MFU. Expected: 65%. nccl-tests showed 270 GB/s vs expected 380.
What looked fine on the first pass:
- GDR active (
lsmod | grep nvidia_peermemconfirmed) - All 8 NICs visible to NCCL
- PFC counters clean (< 10 pauses/sec)
- Spine utilization balanced
What was missed:
NCCL_DEBUG=INFO python ... 2>&1 | grep "GID"
# NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE GID index 0 (type: RoCE v1)
# ^^^^^^^^^^^^^^^^^^^^^^^^
GID index 0 is RoCE v1 (non-routable, link-local). RoCE v2 is index 3. The deployment runbook left the default. Traffic worked within the subnet but DCQCN signals weren't carried correctly, so PFC fired more often than expected — just not enough to show in pause counters.
Fix:
export NCCL_IB_GID_INDEX=3
Result: MFU went from 58% to 64%. One env var, six MFU points, a week of training saved.
Lesson: Validate GID index in your pre-flight checklist. show_gids | grep "RoCE v2" belongs in every runbook.
The 60-second triage card
When the page fires:
□ nvidia-smi GPU util dropping to ~10%?
→ AllReduce bottleneck → step 2
→ Steady at 97% → not the network
□ nccl-tests busBW < 200 GB/s?
→ NIC or GDR problem → step 3
→ ~380 GB/s with slow training → not the network
□ All 8 NICs visible to NCCL? (NCCL_DEBUG=INFO | grep NET/IB)
□ GPUDirect RDMA on? (lsmod | grep nvidia_peermem)
□ All NICs at 400G? (ethtool | grep Speed)
□ PFC counters under 1000/sec? (ethtool -S | grep pause)
□ Spine utilization balanced? (show interfaces counters)
□ One rank consistently slowest? (per-rank timing)
All green and MFU still bad? Hand it off to the ML team. The model is the problem.
What you should remember
- MFU < 60% on a healthy cluster = something is wrong.
- Diagnosis ladder, top-down. Compute or comm → NCCL or fabric → which component → which node.
- Most "fabric is slow" tickets end at step 1. GPU util steady at 97% = your network is fine, look elsewhere.
nccl-testsis the canonical baseline. ~380 GB/s on 8 × 400G NICs. Less = NIC or NCCL config.- GID index 3 = RoCE v2. Validate it in pre-flight. The dumbest, most common config bug.
Next: GPU & Server Hardware → — the machine on the other end of every AllReduce, and why GPUs reshaped the network around them.