Incident Response Playbooks

You're paged at 3:14 AM. The customer is yelling on Slack about training jobs running 10× slower. You have three minutes before someone above you in the org chart starts asking questions.

This page is the playbook. For each of five common scenarios: what to check first, what to do in five minutes, what to escalate.

The general flow

Every incident follows the same four steps:

Assess (60 seconds) — what's the blast radius? One job, one rail, the whole pod?
Contain (5 minutes) — stop the bleeding. Drain a node, kill a misbehaving pod, fail over.
Restore (15-30 minutes) — get the customer back to baseline throughput.
Root-cause (post-incident) — find the underlying cause so it doesn't recur.

The goal of the on-call playbook is to make steps 1-3 fast and mechanical. Step 4 happens during business hours.

Playbook 1: PFC storm

Page reason: alert "pfc_pauses_per_sec > 10000 on N ports for 60 seconds"

Assess (60 sec):

# Which ports are pausing? Sort by recent rate.
gnmic -a switch01 get --path "/interfaces/interface[name=*]/state/counters/pfc-pause-frames-rx"

Identify the first port to start pausing in the window. That's usually the root.

Contain (5 min):

If a specific pod is the culprit (single port pausing first): kill the pod
```
kubectl delete pod -n ai-training <pod-name>
```
Fabric recovers in seconds.
If a whole server (multiple ports from that server): drain the node
```
kubectl drain <node> --ignore-daemonsets
```
Reschedule pods elsewhere.
If you can't identify a source, enable aggressive PFC watchdog:
```
! Arista example
priority-flow-control watchdog action drop timer 50
```
This trades drops for stability — better than a fabric-wide stall.

Restore (30 min):

Verify PFC counters are returning to baseline
Check the customer's training job restarted cleanly
Watch for recurrence in the next 15 minutes

Escalate to L3 if:

The fabric is still paused 10 minutes after containment
You can't identify a root cause
Multiple pods/servers seem to be triggering it simultaneously (might be a fabric-wide issue, not a single bad actor)

Playbook 2: BGP flap

Page reason: alert "bgp_session_state changed N times in 5 minutes for spine-leaf session"

Assess (60 sec):

gnmic -a switch01 get --path "/network-instances/network-instance[name=default]/protocols/protocol[identifier=BGP]/bgp/neighbors/neighbor"

Which neighbor is flapping? One side, or both?

Contain (5 min):

If hardware (one side's link going up/down): shut the link, hard-redirect traffic
```
interface Ethernet 1/5
  shutdown
```
ECMP will rebalance immediately if the rest of the fabric is healthy.
If config (timer mismatch, AS misconfiguration after a change): roll back the last config change on that switch.
If unclear: suppress the alarm temporarily (with team-lead approval) while you investigate.

Restore:

Verify ECMP fan-out is healthy with the bad link out
Customer's job should be unaffected if it had ≥1 healthy path

Escalate to L3 if:

Multiple BGP sessions flapping simultaneously (might be a fabric-wide control-plane issue)
ECMP rebalancing isn't happening fast enough — customer impact persists

Playbook 3: Single slow GPU / rail

Page reason: customer complaint or training-team alert "one rank consistently 2× slower than peers"

Assess (60 sec):

The training team has identified the rank. From there:

# Which pod/server is that rank?
kubectl get pod <pod-name> -n ai-training -o yaml | grep -E "nodeName|hostIP"

# Check that node's NIC counters
ssh <node>
ethtool -S enp01s0 | grep -E "(error|discard|timeout)"

Contain (5 min):

If clear NIC errors / drops: isolate the server by cordoning the node
```
kubectl cordon <node>
```
Customer reschedules the pod elsewhere; bad server gets drained for diagnosis.
If no clear errors but persistently slow: suspect cable/optic on a specific port
```
ethtool <interface>
# Check RX/TX optical levels, link speed, FEC errors
```
If GPU itself is slow (thermal, ECC errors): kick to GPU team, not network's problem yet.

Restore (15-30 min):

Customer reschedules pod on a healthy node
Bad node goes into diagnostic queue

Escalate to L3 if:

The "slow rank" pattern persists across nodes — suggests a fabric-level issue (specific rail's QoS misconfig, hash polarization, etc.)
Multiple customers complaining of similar patterns simultaneously

Playbook 4: Hash polarization (chronic)

Page reason: dashboard shows ECMP imbalance >40% on a member port for >30 minutes

Assess (60 sec):

Which ECMP group? Which member is overloaded?
Are all RoCE flows polarizing to the same path, or just some?

Contain (5 min):

This isn't acute — you're not going to fix it in 5 minutes. The job is to stop the bleed:

Verify ECMP hash includes UDP src port. If not, push a fix.
Increase NCCL_IB_QPS_PER_CONNECTION for the customer's job (requires job restart).
If specific 5-tuples are polarizing: identify which flows, work with the customer to adjust their connection strategy.

Restore (overnight):

Roll out a real fix: ECMP hash tuning, more entropy in NIC source-port selection, or migrate to a higher-radix switch (slower).

Escalate to L3 if:

Polarization is happening on multiple ECMP groups simultaneously (might indicate a vendor firmware bug)
Customer impact is significant (>20% throughput loss persistently)

Playbook 5: NCCL timeouts everywhere

Page reason: multiple training jobs report NCCL_TIMEOUT in the same 10-minute window

Assess (60 sec):

This is fabric-wide, urgent. Top priority.

Check fabric overview: BGP up? PFC sane? Any link down?
Check switch logs for any common event in the affected window
Check NIC firmware/driver version: did anything roll out today?

Contain (5 min):

If a switch is the common factor: fail away from it — drain hosts behind it, or reroute around it
If a recent fabric config push: revert immediately
If a recent NIC driver update: roll back the deployment

Restore (15-30 min):

Customer jobs should auto-restart from checkpoints
Verify they progress at expected speed

Escalate to L3 if:

You can't identify a common factor in 10 minutes
The fabric stays unstable after the obvious fixes

Three rules to remember at 3 AM

Identify before fixing. A wrong "fix" can make things worse. Spend the 60 seconds to assess.
Contain before restoring. Stop the bleed first. Restore to baseline second. Root-cause later.
It's almost always config or hardware, rarely a software bug. Look at what changed first.

What you should remember

Every incident has 4 steps: assess, contain, restore, root-cause. Most of on-call mastery is making 1-3 mechanical.
PFC storms, BGP flaps, slow GPUs, hash polarization, and NCCL timeouts are the five patterns you'll see most.
Each has a 5-minute containment action. Memorize them or have them in a runbook tab open.
Escalate when the playbook doesn't fit. Better to wake your L3 than make it worse.
Document the incident within 24 hours — even if you "just rebooted the switch." Patterns matter.

That's the end of the curriculum's training-fabric story. From Transport & Congestion Control through Production Operations, you have the full mental model — protocols, hardware, RDMA, fabric design, deployment, configuration, host integration, inference, and ops. Head back to the curriculum index for the full map, or jump into whichever section you want to deepen.

The general flow​

Playbook 1: PFC storm​

Playbook 2: BGP flap​

Playbook 3: Single slow GPU / rail​

Playbook 4: Hash polarization (chronic)​

Playbook 5: NCCL timeouts everywhere​

Three rules to remember at 3 AM​

What you should remember​

The general flow

Playbook 1: PFC storm

Playbook 2: BGP flap

Playbook 3: Single slow GPU / rail

Playbook 4: Hash polarization (chronic)

Playbook 5: NCCL timeouts everywhere

Three rules to remember at 3 AM

What you should remember