Skip to main content

Incident Response Playbooks

You're paged at 3:14 AM. The customer is yelling on Slack about training jobs running 10× slower. You have three minutes before someone above you in the org chart starts asking questions.

This page is the playbook. For each of five common scenarios: what to check first, what to do in five minutes, what to escalate.


The general flow

Every incident follows the same four steps:

  1. Assess (60 seconds) — what's the blast radius? One job, one rail, the whole pod?
  2. Contain (5 minutes) — stop the bleeding. Drain a node, kill a misbehaving pod, fail over.
  3. Restore (15-30 minutes) — get the customer back to baseline throughput.
  4. Root-cause (post-incident) — find the underlying cause so it doesn't recur.

The goal of the on-call playbook is to make steps 1-3 fast and mechanical. Step 4 happens during business hours.


Playbook 1: PFC storm

Page reason: alert "pfc_pauses_per_sec > 10000 on N ports for 60 seconds"

Assess (60 sec):

# Which ports are pausing? Sort by recent rate.
gnmic -a switch01 get --path "/interfaces/interface[name=*]/state/counters/pfc-pause-frames-rx"

Identify the first port to start pausing in the window. That's usually the root.

Contain (5 min):

  1. If a specific pod is the culprit (single port pausing first): kill the pod

    kubectl delete pod -n ai-training <pod-name>

    Fabric recovers in seconds.

  2. If a whole server (multiple ports from that server): drain the node

    kubectl drain <node> --ignore-daemonsets

    Reschedule pods elsewhere.

  3. If you can't identify a source, enable aggressive PFC watchdog:

    ! Arista example
    priority-flow-control watchdog action drop timer 50

    This trades drops for stability — better than a fabric-wide stall.

Restore (30 min):

  • Verify PFC counters are returning to baseline
  • Check the customer's training job restarted cleanly
  • Watch for recurrence in the next 15 minutes

Escalate to L3 if:

  • The fabric is still paused 10 minutes after containment
  • You can't identify a root cause
  • Multiple pods/servers seem to be triggering it simultaneously (might be a fabric-wide issue, not a single bad actor)

Playbook 2: BGP flap

Page reason: alert "bgp_session_state changed N times in 5 minutes for spine-leaf session"

Assess (60 sec):

gnmic -a switch01 get --path "/network-instances/network-instance[name=default]/protocols/protocol[identifier=BGP]/bgp/neighbors/neighbor"

Which neighbor is flapping? One side, or both?

Contain (5 min):

  1. If hardware (one side's link going up/down): shut the link, hard-redirect traffic

    interface Ethernet 1/5
    shutdown

    ECMP will rebalance immediately if the rest of the fabric is healthy.

  2. If config (timer mismatch, AS misconfiguration after a change): roll back the last config change on that switch.

  3. If unclear: suppress the alarm temporarily (with team-lead approval) while you investigate.

Restore:

  • Verify ECMP fan-out is healthy with the bad link out
  • Customer's job should be unaffected if it had ≥1 healthy path

Escalate to L3 if:

  • Multiple BGP sessions flapping simultaneously (might be a fabric-wide control-plane issue)
  • ECMP rebalancing isn't happening fast enough — customer impact persists

Playbook 3: Single slow GPU / rail

Page reason: customer complaint or training-team alert "one rank consistently 2× slower than peers"

Assess (60 sec):

The training team has identified the rank. From there:

# Which pod/server is that rank?
kubectl get pod <pod-name> -n ai-training -o yaml | grep -E "nodeName|hostIP"

# Check that node's NIC counters
ssh <node>
ethtool -S enp01s0 | grep -E "(error|discard|timeout)"

Contain (5 min):

  1. If clear NIC errors / drops: isolate the server by cordoning the node

    kubectl cordon <node>

    Customer reschedules the pod elsewhere; bad server gets drained for diagnosis.

  2. If no clear errors but persistently slow: suspect cable/optic on a specific port

    ethtool <interface>
    # Check RX/TX optical levels, link speed, FEC errors
  3. If GPU itself is slow (thermal, ECC errors): kick to GPU team, not network's problem yet.

Restore (15-30 min):

  • Customer reschedules pod on a healthy node
  • Bad node goes into diagnostic queue

Escalate to L3 if:

  • The "slow rank" pattern persists across nodes — suggests a fabric-level issue (specific rail's QoS misconfig, hash polarization, etc.)
  • Multiple customers complaining of similar patterns simultaneously

Playbook 4: Hash polarization (chronic)

Page reason: dashboard shows ECMP imbalance >40% on a member port for >30 minutes

Assess (60 sec):

  • Which ECMP group? Which member is overloaded?
  • Are all RoCE flows polarizing to the same path, or just some?

Contain (5 min):

This isn't acute — you're not going to fix it in 5 minutes. The job is to stop the bleed:

  1. Verify ECMP hash includes UDP src port. If not, push a fix.
  2. Increase NCCL_IB_QPS_PER_CONNECTION for the customer's job (requires job restart).
  3. If specific 5-tuples are polarizing: identify which flows, work with the customer to adjust their connection strategy.

Restore (overnight):

  • Roll out a real fix: ECMP hash tuning, more entropy in NIC source-port selection, or migrate to a higher-radix switch (slower).

Escalate to L3 if:

  • Polarization is happening on multiple ECMP groups simultaneously (might indicate a vendor firmware bug)
  • Customer impact is significant (>20% throughput loss persistently)

Playbook 5: NCCL timeouts everywhere

Page reason: multiple training jobs report NCCL_TIMEOUT in the same 10-minute window

Assess (60 sec):

This is fabric-wide, urgent. Top priority.

  1. Check fabric overview: BGP up? PFC sane? Any link down?
  2. Check switch logs for any common event in the affected window
  3. Check NIC firmware/driver version: did anything roll out today?

Contain (5 min):

  1. If a switch is the common factor: fail away from it — drain hosts behind it, or reroute around it
  2. If a recent fabric config push: revert immediately
  3. If a recent NIC driver update: roll back the deployment

Restore (15-30 min):

  • Customer jobs should auto-restart from checkpoints
  • Verify they progress at expected speed

Escalate to L3 if:

  • You can't identify a common factor in 10 minutes
  • The fabric stays unstable after the obvious fixes

Three rules to remember at 3 AM

  1. Identify before fixing. A wrong "fix" can make things worse. Spend the 60 seconds to assess.
  2. Contain before restoring. Stop the bleed first. Restore to baseline second. Root-cause later.
  3. It's almost always config or hardware, rarely a software bug. Look at what changed first.

What you should remember

  • Every incident has 4 steps: assess, contain, restore, root-cause. Most of on-call mastery is making 1-3 mechanical.
  • PFC storms, BGP flaps, slow GPUs, hash polarization, and NCCL timeouts are the five patterns you'll see most.
  • Each has a 5-minute containment action. Memorize them or have them in a runbook tab open.
  • Escalate when the playbook doesn't fit. Better to wake your L3 than make it worse.
  • Document the incident within 24 hours — even if you "just rebooted the switch." Patterns matter.

That's the end of the curriculum's training-fabric story. From Transport & Congestion Control through Production Operations, you have the full mental model — protocols, hardware, RDMA, fabric design, deployment, configuration, host integration, inference, and ops. Head back to the curriculum index for the full map, or jump into whichever section you want to deepen.