Overview
The most modern and flexible HPC design. Uses Kubernetes for container orchestration, SR-IOV to create up to 16 virtual functions per physical NIC, and RoCE for lossless RDMA GPU-to-GPU communication. No Kubernetes pod can have multiple NICs without Multus. Best design for large shared clusters with multiple teams and workloads.
Best For
✦ Large production AI training clusters
✦ Multi-tenant environments
✦ Cloud-native teams
✦ Multiple users sharing same GPU cluster
✦ High pod density — up to 16 pods per server
Trade-Offs
⚠ Higher setup complexity
⚠ Requires SR-IOV capable NICs
⚠ Switches must be configured for RoCE
⚠ Multus + SR-IOV plugin adds complexity
⚠ More moving parts to troubleshoot