19.1 Sizing & Bill of Materials
What to actually buy — pod sizing tables, GPU server choices, switch choices, cabling, and the supporting networks (storage, management, OOB).
19.2 Configure the Fabric
Step-by-step switch configuration — BGP underlay, QoS classification, PFC enable, ECN WRED, buffer profiles. Arista EOS, Cisco NX-OS, Juniper Junos, and NVIDIA Spectrum side by side.
19.3 Configure the Hosts + Kubernetes
From bare metal to running pods. BIOS, kernel command line, hugepages, driver install, GPU Operator + Network Operator, Multus, SR-IOV CNI, NetworkAttachmentDefinitions, and the pod spec template.
19.4 Validate & Run the First Training Job
The validation pyramid — prove the cluster works at every layer (links → BGP → PFC → ib_write_bw → nccl-tests → training step time) before trusting it with a real job.