Skip to main content

MFU and Diagnosis

:::info Bridge — what you already know When a link goes to 60% utilization without explanation, you don't blame the transceiver and move on — you check interface counters, trace the path, look at the queue. MFU is training's equivalent of interface utilization. When it drops, you work through a checklist. This page is that checklist. :::


What MFU is

MFU (Model FLOP Utilization) = the fraction of a GPU's theoretical peak compute that's actually doing useful work.

MFU = (tokens_per_second × FLOPs_per_token) / (peak_FLOPs_per_GPU × GPU_count)

For a transformer model:
FLOPs_per_token ≈ 6 × model_parameters (a standard approximation)

Example: 7B model, 512 H100s, measuring 150,000 tokens/second throughput
FLOPs_per_token = 6 × 7,000,000,000 = 42 × 10^9
Achieved FLOPs = 150,000 × 42 × 10^9 = 6.3 × 10^15 = 6.3 PetaFLOPs/s
Peak per H100 = 1,979 × 10^12 FLOPS (bf16, no sparsity)
Peak total = 1,979 × 10^12 × 512 = 1,013 PetaFLOPs/s
MFU = 6.3 / 1,013 = 0.62 = 62%

Reference numbers for H100 SXM:

MFUAssessmentLikely cause
65–72%ExcellentNetwork and compute are healthy
55–65%GoodMinor overhead; acceptable
45–55%InvestigateSomething is wrong; worth diagnosing
35–45%Significant issueNetwork or storage bottleneck likely
< 35%SevereJob is wasting most of your budget

Note: MFU targets shift with model architecture. MoE models often run at lower MFU by design (sparse compute).


The diagnosis ladder

When MFU is low, work through this in order. Each step rules out a cause before you go deeper.

Step 1: Is it compute or communication?
Step 2: If communication — is it NCCL, or the fabric?
Step 3: If the fabric — which component? (NIC, ToR, spine, PFC, ECMP)
Step 4: If a NIC — which node? Which interface?

Step 1 — Is it compute or communication?

# Method 1: PyTorch profiler breakdown
# Run for a few steps with profiler enabled:

import torch.profiler

with torch.profiler.profile(
activities=[
torch.profiler.ProfilerActivity.CPU,
torch.profiler.ProfilerActivity.CUDA,
],
record_shapes=True,
with_stack=True,
) as prof:
for step in range(10):
output = model(batch)
loss = criterion(output, labels)
loss.backward()
optimizer.step()

print(prof.key_averages().table(
sort_by="cuda_time_total", row_limit=20
))

# Look for:
# ncclAllReduceRingLLKernel ← AllReduce time
# volta_sgemm / ampere_h16 ← matrix multiply (compute)
# If NCCL > 30% of total time: communication is likely the bottleneck
# Method 2: GPU idle time observation
watch -n 0.1 "nvidia-smi --query-gpu=utilization.gpu --format=csv,noheader"

# Healthy: 97 98 97 98 97 98 (high and stable — compute bound)
# Unhealthy: 97 12 98 11 97 13 (drops to ~10% periodically — waiting for AllReduce)

# The periodic drop to low utilization IS the AllReduce wait.
# Duration of drop = AllReduce duration.
# If AllReduce takes 200ms, you'll see 200ms at ~10% GPU utilization every step.

Step 2 — Is it NCCL configuration or the fabric?

If AllReduce is the bottleneck, first check whether NCCL is configured correctly before blaming the fabric.

# Check 1: Are all NICs visible to NCCL?
NCCL_DEBUG=INFO \
NCCL_DEBUG_SUBSYS=NET \
python -c "
import torch
import torch.distributed as dist
dist.init_process_group('nccl')
" 2>&1 | grep "NET/IB"

# Expected output (8 NICs):
# NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_1:1/RoCE
# [2]mlx5_2:1/RoCE [3]mlx5_3:1/RoCE
# [4]mlx5_4:1/RoCE [5]mlx5_5:1/RoCE
# [6]mlx5_6:1/RoCE [7]mlx5_7:1/RoCE
#
# If only 4 NICs shown: NCCL_IB_HCA is wrong, or RDMA device plugin
# didn't mount all devices → AllReduce bandwidth is capped at 50%

# Check 2: Is GDR (GPUDirect RDMA) active?
NCCL_DEBUG=INFO python ... 2>&1 | grep -i "gdr\|peermem\|cuda memory"
# Should NOT see: "Cuda memory type: cudaMemoryTypeDevice not supported for GDR"
# If you see that: nvidia-peermem module not loaded

# Check 3: NCCL channel count (more channels = more parallelism)
NCCL_DEBUG=INFO python ... 2>&1 | grep "Channel"
# Should see 8–16 channels for an 8-NIC node
# Fewer channels = less parallelism = lower bandwidth utilization
# Baseline measurement — run this before every production job:
cd /opt/nccl-tests # or wherever nccl-tests is installed

# Run on ALL nodes simultaneously (use pdsh, mpirun, or srun):
mpirun -np 512 \ # total GPU count
--hostfile hostfile \
-x NCCL_IB_HCA=mlx5_0,mlx5_1,... \
./build/all_reduce_perf \
-b 1G \ # start at 1GB
-e 8G \ # end at 8GB
-f 2 \ # double each step
-g 1 \ # 1 GPU per process
--op sum

# Key column: "busBW" (bus bandwidth in GB/s)
# Expected for 8×400G NICs: ~380–400 GB/s
# If actual is 200 GB/s: NIC problem or GDR not working
# If actual is 380 GB/s but training is slow: compute or storage bottleneck

Step 3 — Isolate which fabric component

If nccl-tests shows lower-than-expected bandwidth, you're now looking at the fabric.

# Every NIC on every node:
for i in $(seq 0 7); do
speed=$(ethtool eth$i 2>/dev/null | grep Speed)
echo "eth$i: $speed"
done

# Expected: Speed: 400000Mb/s
# If any shows 100000Mb/s: link negotiation failed
# → Check cable, SFP/QSFP, switch port config
# → A single NIC at 100G reduces that node's AllReduce BW by 25%

Check for PFC events

# On the node's NICs:
ethtool -S eth0 | grep -E "pause|pfc"

# Key counters (Mellanox ConnectX):
# tx_pause_ctrl_phy: N ← this NIC is SENDING pause frames upstream
# (this NIC's RX is congested)
# rx_pause_ctrl_phy: N ← this NIC is RECEIVING pause frames
# (the switch is telling this NIC to stop)

# Nonzero values during a training job are normal (mild backpressure)
# Sustained high values (thousands per second) = PFC storm or DCQCN misconfiguration

# On the switch (Arista EOS):
show interfaces ethernet 1/1 pfc statistics
# Tx Pause: N ← switch sent pause to this server NIC
# Rx Pause: N ← switch received pause from this server NIC

Check for ECN / DCQCN events

# On the NIC:
ethtool -S eth0 | grep -E "ecn|cnp|roce_cc"

# Key Mellanox counters:
# np_ecn_marked_roce_packets: N ← packets arriving with ECN CE mark
# np_cnp_sent: N ← CNP (Congestion Notification Packets) this NIC sent
# rp_cnp_ignored: N ← CNPs received and rate was already at minimum
# rp_cnp_handled: N ← CNPs received and rate was reduced

# High np_cnp_sent: this NIC is frequently receiving CE-marked packets
# → congestion in the fabric toward this NIC
# High rp_cnp_handled: this NIC is frequently slowing down
# → this NIC is the congestion source (or is misidentified as one)

Check ECMP balance (per-spine utilization)

# On each spine switch (Arista EOS):
show interfaces counters rates | grep Ethernet

# Compare TX rates across all ports connected to leaf switches.
# If Spine1 shows 90% utilization and Spine3 shows 10%:
# → ECMP hash polarization or elephant flow collision
# → Enable adaptive routing if available
# → Increase QPs per connection: NCCL_IB_QPS_PER_CONNECTION=4

Step 4 — Identify the bad node

When the fabric looks healthy but one rank is always slow, find the bad node:

# NCCL rank timing — add to your training script:
import torch.distributed as dist
import time

dist.barrier()
start = time.perf_counter()
dist.all_reduce(tensor, op=dist.ReduceOp.SUM)
torch.cuda.synchronize()
elapsed = time.perf_counter() - start

# Gather timing from all ranks to rank 0:
timings = [None] * dist.get_world_size()
dist.all_gather_object(timings, elapsed)

if dist.get_rank() == 0:
import numpy as np
arr = np.array(timings)
print(f"AllReduce time: mean={arr.mean():.3f}s, "
f"max={arr.max():.3f}s, "
f"p99={np.percentile(arr,99):.3f}s")
slow_rank = arr.argmax()
print(f"Slowest rank: {slow_rank} ({arr.max():.3f}s vs mean {arr.mean():.3f}s)")

# If one rank is consistently 2–5× slower than median:
# That rank's node has the problem. Go check its NICs and switch port.

The checklist — print this

When MFU is below target:

□ GPU utilization during training? (nvidia-smi)
→ Drops to ~10% periodically? → AllReduce is the bottleneck
→ Stuck at 100%? → Compute is the bottleneck (data loading)

□ nccl-tests bandwidth on all nodes?
→ Expected for 8×400G NICs: ~380 GB/s bus bandwidth
→ Below 200 GB/s: NIC or GDR problem

□ All 8 NICs visible to NCCL? (NCCL_DEBUG=INFO | grep NET/IB)
→ Count the [N]mlx5_X entries

□ GPUDirect RDMA enabled? (NCCL_DEBUG=INFO | grep GDR)
→ lsmod | grep nvidia_peermem

□ NIC link speeds? (ethtool ethX | grep Speed)
→ All 400000Mb/s?

□ PFC pause counters? (ethtool -S ethX | grep pause)
→ Sustained thousands/sec = investigate

□ ECN/DCQCN events? (ethtool -S ethX | grep cnp)
→ High np_cnp_sent = this NIC is in congested path

□ One rank consistently slowest? (add rank timing to script)
→ Find the bad node, check its port on the ToR

□ Spine utilization balanced? (switch show interfaces counters)
→ Unbalanced → ECMP problem → more QPs or adaptive routing

What you should be able to do now

  • Calculate MFU for a training job given tokens/second and model size
  • Determine whether low MFU is compute, communication, or storage
  • Run nccl-tests and interpret the result vs expected bandwidth
  • Read PFC and DCQCN counters from a Mellanox NIC
  • Find the slowest rank in a distributed training job

Where it breaks — the most common MFU killers

SymptomRoot causeFix
MFU 50%, GPU util drops to 10% periodicallyAllReduce is slowWork through diagnosis ladder above
nccl-tests shows 50% expected BWGDR not workingmodprobe nvidia-peermem, verify with NCCL_DEBUG
nccl-tests shows 50% expected BW (GDR fine)Only 4 of 8 NICs in useFix NCCL_IB_HCA env var or RDMA device plugin
One node is always the slowest rankNIC misconfigured (wrong speed, DCQCN off)mlxconfig validate on each node, ethtool speed check
MFU varies 30% step to stepECMP imbalance (elephant collision)More QPs per connection, enable adaptive routing
MFU fine then drops after 6 hoursGPU temperature throttlingCheck nvidia-smi -q -d TEMPERATURE, verify cooling
MFU fine, job crashes every 2–3 hoursNCCL timeout (hardware transient)Tune NCCL_TIMEOUT, check switch error logs

One RCA — 58% MFU tracked to one mismatched GID index

Symptoms: A newly-deployed 128-GPU cluster ran at 58% MFU. Expected was 65%. nccl-tests showed 270 GB/s bus bandwidth vs expected 380 GB/s.

Investigation:

# nccl-tests pointed to network
# GDR checked out (nvidia-peermem loaded, NCCL confirmed GDR active)
# All 8 NICs present in NCCL_DEBUG output
# PFC counters: clean (< 10 pauses/second)
# Spine utilization: balanced

# What was missed initially:
NCCL_DEBUG=INFO python ... 2>&1 | grep "GID"
# NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE GID index 0 (type: RoCE v1)
# ← GID index 0 is RoCEv1 (not routable across subnets)
# ← Traffic was working (same subnet) but not optimal

Root cause: The deployment runbook set NCCL_IB_GID_INDEX=0 (default) instead of NCCL_IB_GID_INDEX=3 (RoCEv2). RoCEv1 GID worked within a subnet but its path selection and ECMP behavior differs from RoCEv2. Specifically, RoCEv1 traffic didn't carry the DCQCN signals correctly, causing more frequent PFC than expected — just not enough to show up as "obvious" in the pause counters.

Fix:

# On all nodes, in the job submission script:
export NCCL_IB_GID_INDEX=3

# Verify the GID type:
show_gids | grep mlx5
# Expected line:
# DEV PORT INDEX GID IPv4 VER DEV
# mlx5_0 1 3 fe80::... 192.168.1.10 RoCE v2 eth0

Result: MFU went from 58% to 64% after changing one environment variable.

Lesson: Always validate GID index in your provisioning runbook. Add show_gids | grep RoCE_v2 to your pre-flight checklist.


Wrapping up this section

You now have a complete model for what AI training is from the network's perspective:

  • What happens in a training step and when the network is involved
  • What collectives are and the traffic pattern each creates
  • How parallelism strategies map to traffic matrices
  • How to measure network contribution to training slowness

The next section goes into the layer that carries all of this — RDMA: kernel bypass, verbs, queue pairs, and memory regions.

Next → RDMA →