Skip to main content

Design 1 — Kubernetes + SR-IOV + RoCE

The most modern and flexible HPC design. Kubernetes for orchestration, SR-IOV for up to 16 virtual NIC functions per physical NIC (so each pod gets dedicated hardware), Multus to attach those VFs as additional pod interfaces, and RoCE for lossless RDMA between GPUs.

Best for: Large production AI training clusters, multi-tenant environments. Trade-offs: Higher setup complexity. Requires SR-IOV-capable NICs and RoCE-tuned switches.

Architecture

Build steps — the 15 layers

From rack power up to NCCL all-reduce. Each layer a place where things break.


What's next