Common Failure Modes
After watching enough AI fabrics in production, the failures cluster into a small number of patterns. This page is the catalog. Each entry: what it looks like, what's underneath, where to look first.
Before the catalog — watch one of these failures get diagnosed live. A PFC storm starts on leaf-3; we fan out to find the worst-affected switch, drill into the port, LLDP-jump to the host behind it, find a CPU-bound trainer overflowing RDMA buffers, drain the node, watch the storm flatline:
- Recognize the seven failure patterns — PFC storm, hash polarization, slow-GPU job killer, NCCL timeout, throughput-half, intermittent packet loss, and cascading PFC, by symptom alone.
- Trace a PFC storm to its origin — find the first port to pause in the window, drill to the downstream server / VF / pod, and know that killing the offending pod recovers the fabric in seconds.
- Diagnose throughput-half in three commands —
lsmod | grep nvidia_peermem,NCCL_DEBUG=INFOfor "P2P/IPC disabled", andnvidia-smi topo -m, because it's almost always GPUDirect RDMA not being active. - Separate network faults from downstream symptoms — recognize that NCCL timeouts and slow ranks usually point to a crashed rank, NUMA mismatch, or flaky optic rather than the fabric itself.
Start here — symptom → cause triage
You're paged. You have one symptom and no time to read seven prose entries. Find the row that matches the first thing you observed, run the one quick check, and jump to the entry it points to.
| First symptom you see | Quick check (1 command) | Most likely mode | Jump to |
|---|---|---|---|
| Training throughput drops ~50%, GPUs at 97% util, no errors or PFC | lsmod | grep nvidia_peermem (AMD: check amd_peer_mem) — module missing? | Throughput-half / GPUDirect RDMA not active | throughput-half |
| PFC PAUSE counters climbing fast on one port, then spreading | Find the first port to pause in the window (sort by timestamp) | PFC storm | PFC storm |
| PFC pauses across many ports at once, all same priority, spreading from one origin | Trace propagation backward in time — who paused whom first? | Cascading PFC | cascading PFC |
| ECMP member ports wildly uneven — one at 95%, siblings at 30% | Per-link utilization on the ECMP group; sample a few flows' 5-tuples | Hash polarization | hash polarization |
Job dies with NCCL_TIMEOUT minutes-to-hours in, peers log "peer disconnected" | Logs of the rank that timed out first; did another rank crash / OOM? | NCCL timeout | NCCL timeout |
| One rank consistently 10–30% slower, others wait at the collective; cluster otherwise fine | Which rank, which server? GPU temps, NIC counters, dmesg on that node | Slow-GPU job killer | slow-GPU job killer |
out_of_buffer_discards creeping up over hours, no PFC, occasional NCCL retransmits | Which port increments it? CRC counters, optical levels below threshold? | Intermittent packet loss | intermittent packet loss |
The rest of this page is the long form of each row — symptom, what's underneath, where to look first, and how to mitigate.
1. PFC storm
Symptoms:
- PFC pause counters climbing fast across many ports
- Training throughput drops to a fraction of expected
- May see PFC watchdog firing (dropping paused traffic)
- Application sees RDMA timeouts on long-running operations
What's underneath:
A burst of congestion somewhere — a slow link, a misconfigured ECMP, a server with a stuck queue — causes a switch to PAUSE its upstream. That switch's upstream gets paused. Pauses propagate backward through the fabric. Eventually a large fraction of the fabric is paused, even though the original hot spot affects only one port.
First look:
- Find the first port that started pausing in the storm window (search by timestamp).
- Look at that port's downstream: is one server stuck, or one VF / pod misbehaving?
- Check whether one GPU job is hot-spotting one rail.
Mitigation:
- PFC watchdog (already enabled? timeout reasonable?) limits the duration.
- Aggressive ECN tuning prevents PFC from firing in the first place.
- If a single misbehaving pod is the root cause, kill it. The fabric recovers in seconds.
2. Hash polarization
Symptoms:
- One ECMP member port at 95% utilization, others at 30%
- Persistent — same ports overloaded step after step
- AllReduce time variance is high (some flows fast, some slow)
- PFC pauses concentrate on the overloaded ports
What's underneath:
The ECMP hash maps too many elephant flows to the same egress port. Could be:
- All flows have nearly identical 5-tuples (single QP, low entropy)
- Hash function genuinely collides for this specific traffic pattern
- A switch firmware bug in hash randomization
First look:
- Per-port utilization on the overloaded ECMP group.
- Sample a few flows: are they all the same source/dest pair, or scattered?
- Compare against another ECMP group: only this one bad, or systemic?
Mitigation:
- Increase
NCCL_IB_QPS_PER_CONNECTIONto get more 5-tuple variation - Tune ECMP hash to include UDP src port (NIC sets it varied per QP)
- Use a higher-radix switch to spread across more paths
- Longer-term: adaptive routing (Spectrum-X) or packet-sprayed transport (UEC, MRC)
3. The "slow GPU" job killer
Symptoms:
- One rank in the job consistently slower than others
- AllReduce time degraded by 10-30%, persistently
- Other jobs sharing the cluster unaffected
- Per-rank telemetry shows it specifically
What's underneath:
Could be many things:
- The slow GPU is in a different NUMA from its NIC (no GPUDirect)
- One NIC has a bad cable, optic, or port (slow but not dead)
- One server's PFC config is wrong (still drops occasionally)
- A pod scheduled poorly (CPU contention with another container)
- Thermal throttling on the GPU itself
First look:
- Which rank is slow? Which server is it on?
- On that server: GPU temps, NIC counters, PCIe error logs,
dmesg - Compare to a healthy server in the same job: what's different?
Mitigation:
- If hardware (cable / optic / NIC): replace.
- If config: fix the inconsistency (use Ansible / Salt to enforce config across the fleet).
- If scheduling: add topology hints to k8s; use Topology Manager.
4. NCCL timeout
Symptoms:
- Job dies with "NCCL_TIMEOUT" or "NCCL error 6: invalid usage"
- Happens minutes-to-hours into a training run, not at startup
- One pod logs a timeout; others log "communication peer disconnected"
What's underneath:
NCCL waited NCCL_IB_TIMEOUT seconds (default ~25) for a response and didn't get one. Could be:
- Actual network outage (link down, BGP flap, switch reboot)
- One rank crashed mid-job; others wait, time out
- A long PFC pause storm exceeded the timeout
- A specific QP / VF got into a bad state and isn't responding
First look:
- Logs of the rank that timed out first
- What happened on the network at that timestamp: BGP, link state, PFC?
- Did another rank crash / get OOM-killed?
Mitigation:
- If network: fix the underlying issue.
- If rank crash: increase memory or fix the application bug.
- Increase
NCCL_IB_TIMEOUTfor jobs known to have long pauses (but: this masks real issues; tune carefully).
5. The mysterious throughput-half problem
Symptoms:
- Job throughput is exactly half what's expected
- All ranks affected uniformly
- No errors, no timeouts, no PFC, no ECN
What's underneath:
99% of the time: GPUDirect RDMA isn't active. The data is bouncing through CPU DRAM, halving effective PCIe bandwidth.
First look:
lsmod | grep nvidia_peermem— is the module loaded?NCCL_DEBUG=INFO: search for "Using internal P2P plugin" vs "P2P/IPC disabled"nvidia-smi topo -m— does the topology suggest peer-to-peer is possible?
Mitigation:
- Load
nvidia_peermemmodule - Verify NIC and GPU are on the same PCIe root complex
- Update NVIDIA driver / CUDA / NCCL versions to match
6. The intermittent packet loss
Symptoms:
out_of_buffer_discardsincreasing slowly over hours- No PFC storms, no obvious congestion
- Training survives but with occasional NCCL retransmits
What's underneath:
A single port somewhere is dropping at low rate. Could be:
- A flaky optic (sometimes degrades over weeks)
- A switch buffer profile undersized for occasional microbursts
- A bad cable causing rare CRC errors
First look:
- Which port is incrementing the counter? Switch or NIC side?
- CRC error counters, optical levels (light below threshold?)
- Buffer occupancy peaks on that port — any spikes?
Mitigation:
- Swap the optic/cable. Most common fix.
- If buffer profile is the issue: increase headroom on that priority.
7. The cascading PFC
Symptoms:
- PFC pauses observed across many ports simultaneously
- All on the same priority class
- Spreading outward from a single starting point
What's underneath:
A real hot spot somewhere upstream — could be a job's elephant flow polarizing onto one port, or a single GPU's queue spilling.
First look:
- Trace the PFC propagation backward: who paused whom, in time order?
- The first port to pause is the origin. Look at what's downstream of it.
- Is one job's traffic responsible? (Check pod / VF mapping for that port.)
Mitigation:
- Identify the offending job/pod
- Migrate it to a different rail if persistent
- Tune ECN so DCQCN reacts faster (and PFC fires less often)
💡 What you should remember
| # | Concept | Why it matters | |
|---|---|---|---|
| 1 | 🔥 | PFC storms are the most common fabric-level failure. | They propagate fast and the root is often one bad port. |
| 2 | 🔀 | Hash polarization is chronic, not acute. | It accumulates lost throughput rather than crashing the job. |
| 3 | 🐌 | The "slow GPU" pattern accounts for huge production tax | config drift, bad cables, NUMA mismatches. |
| 4 | ⏱️ | NCCL timeouts | are rarely the network's fault; usually a downstream symptom of something else. |
| 5 | ⚡ | Throughput-half is almost always GPUDirect not active. | Check nvidia_peermem first. |
| 6 | 🔌 | Intermittent packet loss | usually a flaky optic. Swap it. |
| 7 | 🌊 | Cascading PFC | find the first port to pause; that's the root. |
Next: Incident Response Playbooks → — what to do when one of these fires at 3 AM.