Sizing & Bill of Materials
What to actually buy โ pod sizing tables, GPU server choices, switch choices, cabling, and the supporting networks (storage, management, OOB).
Configure the Fabric
Step-by-step switch configuration โ BGP underlay, QoS classification, PFC enable, ECN WRED, buffer profiles. Arista and NVIDIA Spectrum side by side.
Configure the Hosts + Kubernetes
From bare metal to running pods. BIOS, kernel command line, hugepages, driver install, GPU Operator + Network Operator, Multus, SR-IOV CNI, NetworkAttachmentDefinitions, and the pod spec template.
Validate & Run the First Training Job
The validation pyramid โ prove the cluster works at every layer (links โ BGP โ PFC โ ib_write_bw โ nccl-tests โ training step time) before trusting it with a real job.