Common Failure Modes
After watching enough AI fabrics in production, the failures cluster into a small number of patterns. This page is the catalog. Each entry: what it looks like, what's underneath, where to look first.
1. PFC storm
Symptoms:
- PFC pause counters climbing fast across many ports
- Training throughput drops to a fraction of expected
- May see PFC watchdog firing (dropping paused traffic)
- Application sees RDMA timeouts on long-running operations
What's underneath:
A burst of congestion somewhere — a slow link, a misconfigured ECMP, a server with a stuck queue — causes a switch to PAUSE its upstream. That switch's upstream gets paused. Pauses propagate backward through the fabric. Eventually a large fraction of the fabric is paused, even though the original hot spot affects only one port.
First look:
- Find the first port that started pausing in the storm window (search by timestamp).
- Look at that port's downstream: is one server stuck, or one VF / pod misbehaving?
- Check whether one GPU job is hot-spotting one rail.
Mitigation:
- PFC watchdog (already enabled? timeout reasonable?) limits the duration.
- Aggressive ECN tuning prevents PFC from firing in the first place.
- If a single misbehaving pod is the root cause, kill it. The fabric recovers in seconds.
2. Hash polarization
Symptoms:
- One ECMP member port at 95% utilization, others at 30%
- Persistent — same ports overloaded step after step
- AllReduce time variance is high (some flows fast, some slow)
- PFC pauses concentrate on the overloaded ports
What's underneath:
The ECMP hash maps too many elephant flows to the same egress port. Could be:
- All flows have nearly identical 5-tuples (single QP, low entropy)
- Hash function genuinely collides for this specific traffic pattern
- A switch firmware bug in hash randomization
First look:
- Per-port utilization on the overloaded ECMP group.
- Sample a few flows: are they all the same source/dest pair, or scattered?
- Compare against another ECMP group: only this one bad, or systemic?
Mitigation:
- Increase
NCCL_IB_QPS_PER_CONNECTIONto get more 5-tuple variation - Tune ECMP hash to include UDP src port (NIC sets it varied per QP)
- Use a higher-radix switch to spread across more paths
- Longer-term: adaptive routing (Spectrum-X) or packet-sprayed transport (UEC, MRC)
3. The "slow GPU" job killer
Symptoms:
- One rank in the job consistently slower than others
- AllReduce time degraded by 10-30%, persistently
- Other jobs sharing the cluster unaffected
- Per-rank telemetry shows it specifically
What's underneath:
Could be many things:
- The slow GPU is in a different NUMA from its NIC (no GPUDirect)
- One NIC has a bad cable, optic, or port (slow but not dead)
- One server's PFC config is wrong (still drops occasionally)
- A pod scheduled poorly (CPU contention with another container)
- Thermal throttling on the GPU itself
First look:
- Which rank is slow? Which server is it on?
- On that server: GPU temps, NIC counters, PCIe error logs,
dmesg - Compare to a healthy server in the same job: what's different?
Mitigation:
- If hardware (cable / optic / NIC): replace.
- If config: fix the inconsistency (use Ansible / Salt to enforce config across the fleet).
- If scheduling: add topology hints to k8s; use Topology Manager.
4. NCCL timeout
Symptoms:
- Job dies with "NCCL_TIMEOUT" or "NCCL error 6: invalid usage"
- Happens minutes-to-hours into a training run, not at startup
- One pod logs a timeout; others log "communication peer disconnected"
What's underneath:
NCCL waited NCCL_IB_TIMEOUT seconds (default ~25) for a response and didn't get one. Could be:
- Actual network outage (link down, BGP flap, switch reboot)
- One rank crashed mid-job; others wait, time out
- A long PFC pause storm exceeded the timeout
- A specific QP / VF got into a bad state and isn't responding
First look:
- Logs of the rank that timed out first
- What happened on the network at that timestamp: BGP, link state, PFC?
- Did another rank crash / get OOM-killed?
Mitigation:
- If network: fix the underlying issue.
- If rank crash: increase memory or fix the application bug.
- Increase
NCCL_IB_TIMEOUTfor jobs known to have long pauses (but: this masks real issues; tune carefully).
5. The mysterious throughput-half problem
Symptoms:
- Job throughput is exactly half what's expected
- All ranks affected uniformly
- No errors, no timeouts, no PFC, no ECN
What's underneath:
99% of the time: GPUDirect RDMA isn't active. The data is bouncing through CPU DRAM, halving effective PCIe bandwidth.
First look:
lsmod | grep nvidia_peermem— is the module loaded?NCCL_DEBUG=INFO: search for "Using internal P2P plugin" vs "P2P/IPC disabled"nvidia-smi topo -m— does the topology suggest peer-to-peer is possible?
Mitigation:
- Load
nvidia_peermemmodule - Verify NIC and GPU are on the same PCIe root complex
- Update NVIDIA driver / CUDA / NCCL versions to match
6. The intermittent packet loss
Symptoms:
out_of_buffer_discardsincreasing slowly over hours- No PFC storms, no obvious congestion
- Training survives but with occasional NCCL retransmits
What's underneath:
A single port somewhere is dropping at low rate. Could be:
- A flaky optic (sometimes degrades over weeks)
- A switch buffer profile undersized for occasional microbursts
- A bad cable causing rare CRC errors
First look:
- Which port is incrementing the counter? Switch or NIC side?
- CRC error counters, optical levels (light below threshold?)
- Buffer occupancy peaks on that port — any spikes?
Mitigation:
- Swap the optic/cable. Most common fix.
- If buffer profile is the issue: increase headroom on that priority.
7. The cascading PFC
Symptoms:
- PFC pauses observed across many ports simultaneously
- All on the same priority class
- Spreading outward from a single starting point
What's underneath:
A real hot spot somewhere upstream — could be a job's elephant flow polarizing onto one port, or a single GPU's queue spilling.
First look:
- Trace the PFC propagation backward: who paused whom, in time order?
- The first port to pause is the origin. Look at what's downstream of it.
- Is one job's traffic responsible? (Check pod / VF mapping for that port.)
Mitigation:
- Identify the offending job/pod
- Migrate it to a different rail if persistent
- Tune ECN so DCQCN reacts faster (and PFC fires less often)
What you should remember
- PFC storms are the most common fabric-level failure. They propagate fast and the root is often one bad port.
- Hash polarization is chronic, not acute. It accumulates lost throughput rather than crashing the job.
- The "slow GPU" pattern accounts for huge production tax — config drift, bad cables, NUMA mismatches.
- NCCL timeouts are rarely the network's fault; usually a downstream symptom of something else.
- Throughput-half is almost always GPUDirect not active. Check
nvidia_peermemfirst. - Intermittent packet loss — usually a flaky optic. Swap it.
- Cascading PFC — find the first port to pause; that's the root.
Next: Incident Response Playbooks → — what to do when one of these fires at 3 AM.