Skip to main content

Common Failure Modes

After watching enough AI fabrics in production, the failures cluster into a small number of patterns. This page is the catalog. Each entry: what it looks like, what's underneath, where to look first.


1. PFC storm

Symptoms:

  • PFC pause counters climbing fast across many ports
  • Training throughput drops to a fraction of expected
  • May see PFC watchdog firing (dropping paused traffic)
  • Application sees RDMA timeouts on long-running operations

What's underneath:

A burst of congestion somewhere — a slow link, a misconfigured ECMP, a server with a stuck queue — causes a switch to PAUSE its upstream. That switch's upstream gets paused. Pauses propagate backward through the fabric. Eventually a large fraction of the fabric is paused, even though the original hot spot affects only one port.

First look:

  1. Find the first port that started pausing in the storm window (search by timestamp).
  2. Look at that port's downstream: is one server stuck, or one VF / pod misbehaving?
  3. Check whether one GPU job is hot-spotting one rail.

Mitigation:

  • PFC watchdog (already enabled? timeout reasonable?) limits the duration.
  • Aggressive ECN tuning prevents PFC from firing in the first place.
  • If a single misbehaving pod is the root cause, kill it. The fabric recovers in seconds.

2. Hash polarization

Symptoms:

  • One ECMP member port at 95% utilization, others at 30%
  • Persistent — same ports overloaded step after step
  • AllReduce time variance is high (some flows fast, some slow)
  • PFC pauses concentrate on the overloaded ports

What's underneath:

The ECMP hash maps too many elephant flows to the same egress port. Could be:

  • All flows have nearly identical 5-tuples (single QP, low entropy)
  • Hash function genuinely collides for this specific traffic pattern
  • A switch firmware bug in hash randomization

First look:

  1. Per-port utilization on the overloaded ECMP group.
  2. Sample a few flows: are they all the same source/dest pair, or scattered?
  3. Compare against another ECMP group: only this one bad, or systemic?

Mitigation:

  • Increase NCCL_IB_QPS_PER_CONNECTION to get more 5-tuple variation
  • Tune ECMP hash to include UDP src port (NIC sets it varied per QP)
  • Use a higher-radix switch to spread across more paths
  • Longer-term: adaptive routing (Spectrum-X) or packet-sprayed transport (UEC, MRC)

3. The "slow GPU" job killer

Symptoms:

  • One rank in the job consistently slower than others
  • AllReduce time degraded by 10-30%, persistently
  • Other jobs sharing the cluster unaffected
  • Per-rank telemetry shows it specifically

What's underneath:

Could be many things:

  • The slow GPU is in a different NUMA from its NIC (no GPUDirect)
  • One NIC has a bad cable, optic, or port (slow but not dead)
  • One server's PFC config is wrong (still drops occasionally)
  • A pod scheduled poorly (CPU contention with another container)
  • Thermal throttling on the GPU itself

First look:

  1. Which rank is slow? Which server is it on?
  2. On that server: GPU temps, NIC counters, PCIe error logs, dmesg
  3. Compare to a healthy server in the same job: what's different?

Mitigation:

  • If hardware (cable / optic / NIC): replace.
  • If config: fix the inconsistency (use Ansible / Salt to enforce config across the fleet).
  • If scheduling: add topology hints to k8s; use Topology Manager.

4. NCCL timeout

Symptoms:

  • Job dies with "NCCL_TIMEOUT" or "NCCL error 6: invalid usage"
  • Happens minutes-to-hours into a training run, not at startup
  • One pod logs a timeout; others log "communication peer disconnected"

What's underneath:

NCCL waited NCCL_IB_TIMEOUT seconds (default ~25) for a response and didn't get one. Could be:

  • Actual network outage (link down, BGP flap, switch reboot)
  • One rank crashed mid-job; others wait, time out
  • A long PFC pause storm exceeded the timeout
  • A specific QP / VF got into a bad state and isn't responding

First look:

  1. Logs of the rank that timed out first
  2. What happened on the network at that timestamp: BGP, link state, PFC?
  3. Did another rank crash / get OOM-killed?

Mitigation:

  • If network: fix the underlying issue.
  • If rank crash: increase memory or fix the application bug.
  • Increase NCCL_IB_TIMEOUT for jobs known to have long pauses (but: this masks real issues; tune carefully).

5. The mysterious throughput-half problem

Symptoms:

  • Job throughput is exactly half what's expected
  • All ranks affected uniformly
  • No errors, no timeouts, no PFC, no ECN

What's underneath:

99% of the time: GPUDirect RDMA isn't active. The data is bouncing through CPU DRAM, halving effective PCIe bandwidth.

First look:

  1. lsmod | grep nvidia_peermem — is the module loaded?
  2. NCCL_DEBUG=INFO: search for "Using internal P2P plugin" vs "P2P/IPC disabled"
  3. nvidia-smi topo -m — does the topology suggest peer-to-peer is possible?

Mitigation:

  • Load nvidia_peermem module
  • Verify NIC and GPU are on the same PCIe root complex
  • Update NVIDIA driver / CUDA / NCCL versions to match

6. The intermittent packet loss

Symptoms:

  • out_of_buffer_discards increasing slowly over hours
  • No PFC storms, no obvious congestion
  • Training survives but with occasional NCCL retransmits

What's underneath:

A single port somewhere is dropping at low rate. Could be:

  • A flaky optic (sometimes degrades over weeks)
  • A switch buffer profile undersized for occasional microbursts
  • A bad cable causing rare CRC errors

First look:

  1. Which port is incrementing the counter? Switch or NIC side?
  2. CRC error counters, optical levels (light below threshold?)
  3. Buffer occupancy peaks on that port — any spikes?

Mitigation:

  • Swap the optic/cable. Most common fix.
  • If buffer profile is the issue: increase headroom on that priority.

7. The cascading PFC

Symptoms:

  • PFC pauses observed across many ports simultaneously
  • All on the same priority class
  • Spreading outward from a single starting point

What's underneath:

A real hot spot somewhere upstream — could be a job's elephant flow polarizing onto one port, or a single GPU's queue spilling.

First look:

  1. Trace the PFC propagation backward: who paused whom, in time order?
  2. The first port to pause is the origin. Look at what's downstream of it.
  3. Is one job's traffic responsible? (Check pod / VF mapping for that port.)

Mitigation:

  • Identify the offending job/pod
  • Migrate it to a different rail if persistent
  • Tune ECN so DCQCN reacts faster (and PFC fires less often)

What you should remember

  • PFC storms are the most common fabric-level failure. They propagate fast and the root is often one bad port.
  • Hash polarization is chronic, not acute. It accumulates lost throughput rather than crashing the job.
  • The "slow GPU" pattern accounts for huge production tax — config drift, bad cables, NUMA mismatches.
  • NCCL timeouts are rarely the network's fault; usually a downstream symptom of something else.
  • Throughput-half is almost always GPUDirect not active. Check nvidia_peermem first.
  • Intermittent packet loss — usually a flaky optic. Swap it.
  • Cascading PFC — find the first port to pause; that's the root.

Next: Incident Response Playbooks → — what to do when one of these fires at 3 AM.