12.1 SR-IOV Mechanics
How SR-IOV actually works — Physical Functions, Virtual Functions, IOMMU, what the BIOS / kernel / driver need to agree on, and the most common misconfigurations.
12.2 Multus and Multi-NIC Pods
How a Kubernetes pod gets multiple network interfaces — one for k8s control plane, eight for RDMA rails. NetworkAttachmentDefinitions, pod annotations, and the YAML you'll actually write.
12.3 NCCL and GPUDirect Configuration
How NCCL picks NICs, how GPUDirect RDMA makes NIC ↔ GPU memory transfers zero-copy, and the environment variables that decide whether training runs at full speed or half.
12.4 Host-Side Lossless — mlnx_qos, sysfs, and the Counter Reference Card
The host half of lossless RoCE — mlnx_qos for PFC/ETS, trust-mode dscp on the NIC, ring buffers, the NCCL_IB_TC=106 math, and the canonical RDMA counter reference (hw_counters vs counters, what each path means, the pre/post-test diff pattern). Pair with the switch-side config in Switch QoS.
12.5 Multi-Rail Source Routing on the Host
Why default Linux routing breaks multi-rail RoCE (all traffic exits one NIC), the 256-routing-table architecture you'll use to fix it, the per-NIC source-rule pattern, ARP flux + rp_filter gotchas, and the end-to-end config that makes 4-rail hosts actually pump 1.5+ Tbps.
12.6 Provisioning the GPU Host
A worked example of the host-side automation that turns a bare GPU server into a fabric-ready RoCE endpoint — driver stack, OFED, GPUDirect, PCIe ACS, NVSwitch fabric manager, and DCGM telemetry — mapped to a real Puppet module layout.