When Training Slows

When MFU drops, the fabric is the loudest suspect — but rarely the culprit if PFC and ECN counters are clean.

The hardest production incidents are the ones where the network is fine and something else is broken. The reflex you need is "rule the fabric out fast, then look elsewhere." Here's the order.

After this page, you'll be able to

Read MFU as your fabric readout — recognize that below 60% on a healthy cluster something is wrong, and adjust the bar for MoE models that run sparse by design.
Walk the four-step diagnosis ladder — compute or communication (nvidia-smi util at 97% vs. dropping to 10%), NCCL config or fabric, which fabric component, which node — without skipping steps.
Run the canonical checks — nccl-tests all_reduce_perf for ~380 GB/s busBW on 8×400G, NCCL_DEBUG=INFO | grep NET/IB for all 8 NICs, and ethtool -S for link speed, pause/pfc, and cnp/ecn counters.
Catch the RoCE v1 trap — spot GID index 0 in NCCL logs, set NCCL_IB_GID_INDEX=3 for RoCE v2, and put show_gids | grep "RoCE v2" in the pre-flight checklist.

MFU in one sentence

Model FLOP Utilization = fraction of theoretical peak GPU compute actually doing useful work.

ML engineers compute it; you read it:

MFU	Health	What you do
65–72%	Excellent	Record as baseline
55–65%	Good	Watch trends, no action
45–55%	Investigate	Diagnosis ladder below
35–45%	Significant issue	Network OR storage hurting you
< 35%	Severe	Something fundamental is wrong

Rule of thumb: below 60% on a healthy cluster, something is wrong.

(MoE models often run lower MFU by design — sparse compute. Adjust expectations.)

The diagnosis ladder

Work top-down. Each step rules out a class of cause before going deeper.

Is it compute or communication?
If communication — is it NCCL config, or the fabric?
If the fabric — which component? (NIC, ToR, spine, ECMP)
If a NIC — which node?

Don't skip steps. Most "the fabric is slow" tickets end at step 1.

Step 1 — Compute or communication?

The fastest signal you can read in five seconds:

watch -n 0.1 "nvidia-smi --query-gpu=utilization.gpu --format=csv,noheader"
# AMD equivalent:
watch -n 0.1 "rocm-smi --showuse"

Steady at 97–98% → compute-bound. Your network is fine. Look at the model.
Drops to ~10% periodically → communication is the bottleneck. Duration of low utilization = duration of AllReduce wait.

That's it. One command separates "your problem" from "not your problem."

Step 2 — NCCL config or fabric?

Check NCCL FIRST before blaming the fabric. The most common cause of 50% AllReduce bandwidth is misconfiguration, not silicon.

Are all 8 NICs visible to NCCL?

NCCL_DEBUG=INFO python train.py 2>&1 | grep "NET/IB"
# Should show all 8:
# NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE ... [7]mlx5_7:1/RoCE
# Only 4 → NCCL_IB_HCA env var is wrong → bandwidth capped at 50%
# On AMD: NCCL_DEBUG=INFO works unchanged — RCCL reuses every NCCL_* var.

Is GPUDirect RDMA active?

NCCL_DEBUG=INFO python ... 2>&1 | grep -i "gdr\|peermem"
# Must NOT see: "Cuda memory type ... not supported for GDR"
# If seen: modprobe nvidia_peermem

Then run the canonical benchmark:

mpirun -np 512 --hostfile hosts ./build/all_reduce_perf -b 1G -e 8G -f 2
# Expected for 8×400G NICs: ~380 GB/s busBW
# 200 GB/s → NIC or GDR problem (back to NCCL config)
# 380 GB/s with slow training → it's not the fabric

If nccl-tests hits 380 GB/s but training is still slow, the network is fine. Go check the model, the data loader, the host. Stop here.

Step 3 — Which fabric component?

NCCL is fine and nccl-tests is slow. Now you're looking at the fabric. Three counter checks, in order:

Link speeds — any NIC below 400G?

for n in 0 1 2 3 4 5 6 7; do ethtool eth$n | grep Speed; done
# Speed: 100000Mb/s on any NIC → that node's BW drops by 25%

PFC events — sustained or spiking?

ethtool -S eth0 | grep -E "pause|pfc"
# tx_pause_ctrl_phy sustained in thousands/sec = PFC storm or DCQCN misconfig
# A few/sec is normal backpressure

ECN/CNP — is congestion getting marked?

ethtool -S eth0 | grep -E "cnp|ecn"
# np_cnp_sent climbing → this NIC sits in a congested path
# rp_cnp_handled climbing → this NIC is being asked to slow down

Spine balance — ECMP working?

On each spine: show interfaces counters rates
If spine-1 at 90% and spine-4 at 30%: hash polarization or elephant collision
Mitigation: NCCL_IB_QPS_PER_CONNECTION=4 (more entropy)
Or enable adaptive routing if your fabric supports it.

Step 4 — Which node?

Fabric looks globally healthy but training shows one rank consistently slow. Add per-rank timing:

# In the training script
dist.barrier(); start = time.perf_counter()
dist.all_reduce(tensor, op=dist.ReduceOp.SUM)
torch.cuda.synchronize(); elapsed = time.perf_counter() - start

timings = [None] * dist.get_world_size()
dist.all_gather_object(timings, elapsed)
if dist.get_rank() == 0:
    print(f"slowest rank: {np.argmax(timings)} at "
          f"{max(timings):.3f}s vs median {np.median(timings):.3f}s")

When one rank is 2–5× slower than median, that node has the problem. Check its NICs, its ToR port, its GPU temps. Drain it if it can't be fixed live.

One real RCA — 58% MFU, clean fabric, one env var

A 128-GPU cluster came up at 58% MFU. Expected: 65%. nccl-tests showed 270 GB/s vs expected 380.

What looked fine on the first pass:

GDR active (lsmod | grep nvidia_peermem confirmed)
All 8 NICs visible to NCCL
PFC counters clean (< 10 pauses/sec)
Spine utilization balanced

What was missed:

NCCL_DEBUG=INFO python ... 2>&1 | grep "GID"
# NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE GID index 0 (type: RoCE v1)
#                                            ^^^^^^^^^^^^^^^^^^^^^^^^

GID index 0 is RoCE v1 (non-routable, link-local). RoCE v2 is index 3. The deployment runbook left the default. Traffic worked within the subnet but DCQCN signals weren't carried correctly, so PFC fired more often than expected — just not enough to show in pause counters.

Fix:

export NCCL_IB_GID_INDEX=3

Result: MFU went from 58% to 64%. One env var, six MFU points, a week of training saved.

Lesson: Validate GID index in your pre-flight checklist. show_gids | grep "RoCE v2" belongs in every runbook.

The 60-second triage card

When the page fires:

□ nvidia-smi GPU util dropping to ~10%?  (AMD: rocm-smi --showuse)
  → AllReduce bottleneck → step 2
  → Steady at 97% → not the network

□ nccl-tests busBW < 200 GB/s?
  → NIC or GDR problem → step 3
  → ~380 GB/s with slow training → not the network

□ All 8 NICs visible to NCCL? (NCCL_DEBUG=INFO | grep NET/IB)
□ GPUDirect RDMA on? (lsmod | grep nvidia_peermem)
□ All NICs at 400G? (ethtool | grep Speed)
□ PFC counters under 1000/sec? (ethtool -S | grep pause)
□ Spine utilization balanced? (show interfaces counters)
□ One rank consistently slowest? (per-rank timing)

All green and MFU still bad? Hand it off to the ML team. The model is the problem.

💡 What you should remember

#		Concept	Why it matters
1	📉	MFU < 60% on a healthy cluster = something is wrong.
2	🪜	Diagnosis ladder, top-down.	Compute or comm → NCCL or fabric → which component → which node.
3	🧠	Most "fabric is slow" tickets end at step 1.	GPU util steady at 97% = your network is fine, look elsewhere.
4	📊	`nccl-tests` is the canonical baseline.	~380 GB/s on 8 × 400G NICs. Less = NIC or NCCL config.
5	🏷️	GID index 3 = RoCE v2.	Validate it in pre-flight. The dumbest, most common config bug.

Next: RoCE v2 Operator Cheatsheet → — the one-page command reference for the box: NIC inventory, SR-IOV, GID, DCQCN, perftest, NCCL vars, and the 2-minute health check.

MFU in one sentence​

The diagnosis ladder​

Step 1 — Compute or communication?​

Step 2 — NCCL config or fabric?​

Step 3 — Which fabric component?​

Step 4 — Which node?​

One real RCA — 58% MFU, clean fabric, one env var​

The 60-second triage card​

💡 What you should remember​