Design 1 — Kubernetes + SR-IOV + RoCE
The 15-layer build of a multi-tenant Kubernetes AI training cluster. SR-IOV virtual functions give each pod a dedicated hardware NIC path, Multus stitches multi-NIC pods, RoCE handles RDMA over lossless Ethernet. Most modern flexible HPC design.
Design 2 — Bare Metal + Slurm + RoCE
Classic HPC pattern. No containers, no Kubernetes — Slurm schedules jobs directly on bare-metal nodes. Lowest software overhead, highest performance ceiling. The model national labs run.
Design 3 — Kubernetes + Physical NIC + RoCE
Kubernetes without SR-IOV. Each node has one physical NIC shared by all pods through the host networking stack. Simpler to set up, lower fabric isolation.
Design 4 — Bare Metal + MPI + RoCE
Pure-MPI bare-metal design. No job scheduler. Direct mpirun across hosts over RoCE. The simplest possible HPC setup — and what most NCCL tutorials assume underneath.
Design 5 — Hybrid: Kubernetes + Slurm + RoCE
Production-realistic hybrid. Kubernetes runs inference, dashboards, and CI. Slurm runs the bare-metal HPC training jobs. Same RoCE fabric, both worlds, one cluster footprint.