15.1 Design 1 — Kubernetes + SR-IOV + RoCE
The 15-layer build of a multi-tenant Kubernetes AI training cluster. SR-IOV virtual functions give each pod a dedicated hardware NIC path, Multus stitches multi-NIC pods, RoCE handles RDMA over lossless Ethernet. Most modern flexible HPC design.
15.2 Design 2 — Bare Metal + Slurm + RoCE
Classic HPC pattern. No containers, no Kubernetes — Slurm schedules jobs directly on bare-metal nodes. Lowest software overhead, highest performance ceiling. The model national labs run.
15.3 Design 3 — Kubernetes + Physical NIC + RoCE
Kubernetes without SR-IOV. Each node has one physical NIC shared by all pods through the host networking stack. Simpler to set up, lower fabric isolation.
15.4 Design 4 — Bare Metal + MPI + RoCE
Pure-MPI bare-metal design. No job scheduler. Direct mpirun across hosts over RoCE. The simplest possible HPC setup — and what most NCCL tutorials assume underneath.
15.5 Design 5 — Hybrid: K8s + Slurm + RoCE
Production-realistic hybrid. Kubernetes runs inference, dashboards, and CI. Slurm runs the bare-metal HPC training jobs. Same RoCE fabric, both worlds, one cluster footprint.