18.1 What to Monitor
The signals that actually matter for an AI fabric — golden signals, per-priority counters, NIC-side RDMA telemetry, and what the dashboards a network engineer needs to keep open.
18.2 Common Failure Modes
The failures you'll actually see in production AI fabrics — PFC storms, hash polarization, slow links, NIC errors, NCCL timeouts. Symptoms, root causes, and what to look at first.
18.3 Incident Response Playbooks
When the page fires at 3 AM, you don't have time to think from scratch. These are the playbooks — assess, contain, restore, root-cause. Five concrete scenarios, written for the on-call engineer with three minutes.
18.4 When Training Slows
MFU as your readout, the four-step diagnosis ladder, three patterns you'll actually find, and one RCA that ends with one env var.
18.5 RoCE v2 Operator Cheatsheet
The commands you'll actually type. Box identity, NIC inventory, ibv_devinfo decoded, driver/firmware stack, SR-IOV lifecycle, GID tables, lossless config, RDMA counters, multi-rail routing, perftest benchmarking, NCCL env vars, pre-flight checks, troubleshooting recipes. Single-page reference.