What to Monitor
The signals that actually matter for an AI fabric — golden signals, per-priority counters, NIC-side RDMA telemetry, and what the dashboards a network engineer needs to keep open.
Common Failure Modes
The failures you'll actually see in production AI fabrics — PFC storms, hash polarization, slow links, NIC errors, NCCL timeouts. Symptoms, root causes, and what to look at first.
Incident Response Playbooks
When the page fires at 3 AM, you don't have time to think from scratch. These are the playbooks — assess, contain, restore, root-cause. Five concrete scenarios, written for the on-call engineer with three minutes.