Incident Response Playbooks
You're paged at 3:14 AM. The customer is yelling on Slack about training jobs running 10× slower. You have three minutes before someone above you in the org chart starts asking questions.
This page is the playbook. For each of five common scenarios: what to check first, what to do in five minutes, what to escalate.
The general flow
Every incident follows the same four steps:
- Assess (60 seconds) — what's the blast radius? One job, one rail, the whole pod?
- Contain (5 minutes) — stop the bleeding. Drain a node, kill a misbehaving pod, fail over.
- Restore (15-30 minutes) — get the customer back to baseline throughput.
- Root-cause (post-incident) — find the underlying cause so it doesn't recur.
The goal of the on-call playbook is to make steps 1-3 fast and mechanical. Step 4 happens during business hours.
Playbook 1: PFC storm
Page reason: alert "pfc_pauses_per_sec > 10000 on N ports for 60 seconds"
Assess (60 sec):
# Which ports are pausing? Sort by recent rate.
gnmic -a switch01 get --path "/interfaces/interface[name=*]/state/counters/pfc-pause-frames-rx"
Identify the first port to start pausing in the window. That's usually the root.
Contain (5 min):
-
If a specific pod is the culprit (single port pausing first): kill the pod
kubectl delete pod -n ai-training <pod-name>Fabric recovers in seconds.
-
If a whole server (multiple ports from that server): drain the node
kubectl drain <node> --ignore-daemonsetsReschedule pods elsewhere.
-
If you can't identify a source, enable aggressive PFC watchdog:
! Arista examplepriority-flow-control watchdog action drop timer 50This trades drops for stability — better than a fabric-wide stall.
Restore (30 min):
- Verify PFC counters are returning to baseline
- Check the customer's training job restarted cleanly
- Watch for recurrence in the next 15 minutes
Escalate to L3 if:
- The fabric is still paused 10 minutes after containment
- You can't identify a root cause
- Multiple pods/servers seem to be triggering it simultaneously (might be a fabric-wide issue, not a single bad actor)
Playbook 2: BGP flap
Page reason: alert "bgp_session_state changed N times in 5 minutes for spine-leaf session"
Assess (60 sec):
gnmic -a switch01 get --path "/network-instances/network-instance[name=default]/protocols/protocol[identifier=BGP]/bgp/neighbors/neighbor"
Which neighbor is flapping? One side, or both?
Contain (5 min):
-
If hardware (one side's link going up/down): shut the link, hard-redirect traffic
interface Ethernet 1/5shutdownECMP will rebalance immediately if the rest of the fabric is healthy.
-
If config (timer mismatch, AS misconfiguration after a change): roll back the last config change on that switch.
-
If unclear: suppress the alarm temporarily (with team-lead approval) while you investigate.
Restore:
- Verify ECMP fan-out is healthy with the bad link out
- Customer's job should be unaffected if it had ≥1 healthy path
Escalate to L3 if:
- Multiple BGP sessions flapping simultaneously (might be a fabric-wide control-plane issue)
- ECMP rebalancing isn't happening fast enough — customer impact persists
Playbook 3: Single slow GPU / rail
Page reason: customer complaint or training-team alert "one rank consistently 2× slower than peers"
Assess (60 sec):
The training team has identified the rank. From there:
# Which pod/server is that rank?
kubectl get pod <pod-name> -n ai-training -o yaml | grep -E "nodeName|hostIP"
# Check that node's NIC counters
ssh <node>
ethtool -S enp01s0 | grep -E "(error|discard|timeout)"
Contain (5 min):
-
If clear NIC errors / drops: isolate the server by cordoning the node
kubectl cordon <node>Customer reschedules the pod elsewhere; bad server gets drained for diagnosis.
-
If no clear errors but persistently slow: suspect cable/optic on a specific port
ethtool <interface># Check RX/TX optical levels, link speed, FEC errors -
If GPU itself is slow (thermal, ECC errors): kick to GPU team, not network's problem yet.
Restore (15-30 min):
- Customer reschedules pod on a healthy node
- Bad node goes into diagnostic queue
Escalate to L3 if:
- The "slow rank" pattern persists across nodes — suggests a fabric-level issue (specific rail's QoS misconfig, hash polarization, etc.)
- Multiple customers complaining of similar patterns simultaneously
Playbook 4: Hash polarization (chronic)
Page reason: dashboard shows ECMP imbalance >40% on a member port for >30 minutes
Assess (60 sec):
- Which ECMP group? Which member is overloaded?
- Are all RoCE flows polarizing to the same path, or just some?
Contain (5 min):
This isn't acute — you're not going to fix it in 5 minutes. The job is to stop the bleed:
- Verify ECMP hash includes UDP src port. If not, push a fix.
- Increase
NCCL_IB_QPS_PER_CONNECTIONfor the customer's job (requires job restart). - If specific 5-tuples are polarizing: identify which flows, work with the customer to adjust their connection strategy.
Restore (overnight):
- Roll out a real fix: ECMP hash tuning, more entropy in NIC source-port selection, or migrate to a higher-radix switch (slower).
Escalate to L3 if:
- Polarization is happening on multiple ECMP groups simultaneously (might indicate a vendor firmware bug)
- Customer impact is significant (>20% throughput loss persistently)
Playbook 5: NCCL timeouts everywhere
Page reason: multiple training jobs report NCCL_TIMEOUT in the same 10-minute window
Assess (60 sec):
This is fabric-wide, urgent. Top priority.
- Check fabric overview: BGP up? PFC sane? Any link down?
- Check switch logs for any common event in the affected window
- Check NIC firmware/driver version: did anything roll out today?
Contain (5 min):
- If a switch is the common factor: fail away from it — drain hosts behind it, or reroute around it
- If a recent fabric config push: revert immediately
- If a recent NIC driver update: roll back the deployment
Restore (15-30 min):
- Customer jobs should auto-restart from checkpoints
- Verify they progress at expected speed
Escalate to L3 if:
- You can't identify a common factor in 10 minutes
- The fabric stays unstable after the obvious fixes
Three rules to remember at 3 AM
- Identify before fixing. A wrong "fix" can make things worse. Spend the 60 seconds to assess.
- Contain before restoring. Stop the bleed first. Restore to baseline second. Root-cause later.
- It's almost always config or hardware, rarely a software bug. Look at what changed first.
What you should remember
- Every incident has 4 steps: assess, contain, restore, root-cause. Most of on-call mastery is making 1-3 mechanical.
- PFC storms, BGP flaps, slow GPUs, hash polarization, and NCCL timeouts are the five patterns you'll see most.
- Each has a 5-minute containment action. Memorize them or have them in a runbook tab open.
- Escalate when the playbook doesn't fit. Better to wake your L3 than make it worse.
- Document the incident within 24 hours — even if you "just rebooted the switch." Patterns matter.
That's the end of the curriculum's training-fabric story. From Transport & Congestion Control through Production Operations, you have the full mental model — protocols, hardware, RDMA, fabric design, deployment, configuration, host integration, inference, and ops. Head back to the curriculum index for the full map, or jump into whichever section you want to deepen.