Common Failure Modes

After watching enough AI fabrics in production, the failures cluster into a small number of patterns. This page is the catalog. Each entry: what it looks like, what's underneath, where to look first.

1. PFC storm

Symptoms:

PFC pause counters climbing fast across many ports
Training throughput drops to a fraction of expected
May see PFC watchdog firing (dropping paused traffic)
Application sees RDMA timeouts on long-running operations

What's underneath:

A burst of congestion somewhere — a slow link, a misconfigured ECMP, a server with a stuck queue — causes a switch to PAUSE its upstream. That switch's upstream gets paused. Pauses propagate backward through the fabric. Eventually a large fraction of the fabric is paused, even though the original hot spot affects only one port.

First look:

Find the first port that started pausing in the storm window (search by timestamp).
Look at that port's downstream: is one server stuck, or one VF / pod misbehaving?
Check whether one GPU job is hot-spotting one rail.

Mitigation:

PFC watchdog (already enabled? timeout reasonable?) limits the duration.
Aggressive ECN tuning prevents PFC from firing in the first place.
If a single misbehaving pod is the root cause, kill it. The fabric recovers in seconds.

2. Hash polarization

Symptoms:

One ECMP member port at 95% utilization, others at 30%
Persistent — same ports overloaded step after step
AllReduce time variance is high (some flows fast, some slow)
PFC pauses concentrate on the overloaded ports

What's underneath:

The ECMP hash maps too many elephant flows to the same egress port. Could be:

All flows have nearly identical 5-tuples (single QP, low entropy)
Hash function genuinely collides for this specific traffic pattern
A switch firmware bug in hash randomization

First look:

Per-port utilization on the overloaded ECMP group.
Sample a few flows: are they all the same source/dest pair, or scattered?
Compare against another ECMP group: only this one bad, or systemic?

Mitigation:

Increase NCCL_IB_QPS_PER_CONNECTION to get more 5-tuple variation
Tune ECMP hash to include UDP src port (NIC sets it varied per QP)
Use a higher-radix switch to spread across more paths
Longer-term: adaptive routing (Spectrum-X) or packet-sprayed transport (UEC, MRC)

3. The "slow GPU" job killer

Symptoms:

One rank in the job consistently slower than others
AllReduce time degraded by 10-30%, persistently
Other jobs sharing the cluster unaffected
Per-rank telemetry shows it specifically

What's underneath:

Could be many things:

The slow GPU is in a different NUMA from its NIC (no GPUDirect)
One NIC has a bad cable, optic, or port (slow but not dead)
One server's PFC config is wrong (still drops occasionally)
A pod scheduled poorly (CPU contention with another container)
Thermal throttling on the GPU itself

First look:

Which rank is slow? Which server is it on?
On that server: GPU temps, NIC counters, PCIe error logs, dmesg
Compare to a healthy server in the same job: what's different?

Mitigation:

If hardware (cable / optic / NIC): replace.
If config: fix the inconsistency (use Ansible / Salt to enforce config across the fleet).
If scheduling: add topology hints to k8s; use Topology Manager.

4. NCCL timeout

Symptoms:

Job dies with "NCCL_TIMEOUT" or "NCCL error 6: invalid usage"
Happens minutes-to-hours into a training run, not at startup
One pod logs a timeout; others log "communication peer disconnected"

What's underneath:

NCCL waited NCCL_IB_TIMEOUT seconds (default ~25) for a response and didn't get one. Could be:

Actual network outage (link down, BGP flap, switch reboot)
One rank crashed mid-job; others wait, time out
A long PFC pause storm exceeded the timeout
A specific QP / VF got into a bad state and isn't responding

First look:

Logs of the rank that timed out first
What happened on the network at that timestamp: BGP, link state, PFC?
Did another rank crash / get OOM-killed?

Mitigation:

If network: fix the underlying issue.
If rank crash: increase memory or fix the application bug.
Increase NCCL_IB_TIMEOUT for jobs known to have long pauses (but: this masks real issues; tune carefully).

5. The mysterious throughput-half problem

Symptoms:

Job throughput is exactly half what's expected
All ranks affected uniformly
No errors, no timeouts, no PFC, no ECN

What's underneath:

99% of the time: GPUDirect RDMA isn't active. The data is bouncing through CPU DRAM, halving effective PCIe bandwidth.

First look:

lsmod | grep nvidia_peermem — is the module loaded?
NCCL_DEBUG=INFO: search for "Using internal P2P plugin" vs "P2P/IPC disabled"
nvidia-smi topo -m — does the topology suggest peer-to-peer is possible?

Mitigation:

Load nvidia_peermem module
Verify NIC and GPU are on the same PCIe root complex
Update NVIDIA driver / CUDA / NCCL versions to match

6. The intermittent packet loss

Symptoms:

out_of_buffer_discards increasing slowly over hours
No PFC storms, no obvious congestion
Training survives but with occasional NCCL retransmits

What's underneath:

A single port somewhere is dropping at low rate. Could be:

A flaky optic (sometimes degrades over weeks)
A switch buffer profile undersized for occasional microbursts
A bad cable causing rare CRC errors

First look:

Which port is incrementing the counter? Switch or NIC side?
CRC error counters, optical levels (light below threshold?)
Buffer occupancy peaks on that port — any spikes?

Mitigation:

Swap the optic/cable. Most common fix.
If buffer profile is the issue: increase headroom on that priority.

7. The cascading PFC

Symptoms:

PFC pauses observed across many ports simultaneously
All on the same priority class
Spreading outward from a single starting point

What's underneath:

A real hot spot somewhere upstream — could be a job's elephant flow polarizing onto one port, or a single GPU's queue spilling.

First look:

Trace the PFC propagation backward: who paused whom, in time order?
The first port to pause is the origin. Look at what's downstream of it.
Is one job's traffic responsible? (Check pod / VF mapping for that port.)

Mitigation:

Identify the offending job/pod
Migrate it to a different rail if persistent
Tune ECN so DCQCN reacts faster (and PFC fires less often)

What you should remember

PFC storms are the most common fabric-level failure. They propagate fast and the root is often one bad port.
Hash polarization is chronic, not acute. It accumulates lost throughput rather than crashing the job.
The "slow GPU" pattern accounts for huge production tax — config drift, bad cables, NUMA mismatches.
NCCL timeouts are rarely the network's fault; usually a downstream symptom of something else.
Throughput-half is almost always GPUDirect not active. Check nvidia_peermem first.
Intermittent packet loss — usually a flaky optic. Swap it.
Cascading PFC — find the first port to pause; that's the root.

Next: Incident Response Playbooks → — what to do when one of these fires at 3 AM.

1. PFC storm​

2. Hash polarization​

3. The "slow GPU" job killer​

4. NCCL timeout​

5. The mysterious throughput-half problem​

6. The intermittent packet loss​

7. The cascading PFC​

What you should remember​

1. PFC storm

2. Hash polarization

3. The "slow GPU" job killer

4. NCCL timeout

5. The mysterious throughput-half problem

6. The intermittent packet loss

7. The cascading PFC

What you should remember