Skip to main content

Design 1 — Kubernetes + SR-IOV + RoCE

The most modern and flexible HPC design. Kubernetes for orchestration, SR-IOV for up to 16 virtual NIC functions per physical NIC (so each pod gets dedicated hardware), Multus to attach those VFs as additional pod interfaces, and RoCE for lossless RDMA between GPUs.

Best for: Large production AI training clusters, multi-tenant environments. Trade-offs: Higher setup complexity. Requires SR-IOV-capable NICs and RoCE-tuned switches.

After this page, you'll be able to
  1. Walk the 15-layer stack for this design and name where it breaks — from SR-IOV VF creation and the SR-IOV Network Device Plugin through Multus secondary interfaces up to NCCL all-reduce, the layers a generic K8s cluster doesn't have.
  2. Quantify the multi-tenant payoff — why slicing one NIC into up to 16 virtual functions lets many pods share hardware with isolated queues, and what that buys you over a shared physical NIC.
  3. Decide when this design earns its setup cost — pick it over the other four when multi-tenancy and per-pod NIC isolation are hard requirements, not when a single team would never touch the partitioning.

Architecture

Build steps — the 15 layers

From rack power up to NCCL all-reduce. Each layer a place where things break.


When to pick this design

Pick this when:

  • You run a large multi-tenant production cluster where many teams or jobs share the same GPU nodes and you need each pod to get its own isolated NIC path.
  • Your NICs are SR-IOV-capable (up to 16 VFs each) and your switches are already RoCE-tuned with PFC and ECN.
  • You want cloud-native orchestration — declarative scheduling, self-service, rolling upgrades — without giving up RDMA performance.

Avoid it when:

  • A single small team owns the whole cluster and never needs to slice a NIC — the SR-IOV + Multus machinery is pure overhead. Reach for Design 3 instead.
  • Your hardware doesn't expose SR-IOV, or you can't budget the operational complexity of Device Plugin, Multus, and per-VF tuning.
  • You need the absolute highest performance ceiling for one big synchronous job — bare metal (Design 2) shaves the container and CNI layers.

💡 What you should remember

#ConceptWhy it matters
1🔪SR-IOV slices one physical NIC into up to 16 virtual functionsEach pod gets a dedicated hardware NIC path with its own queues, so tenants don't fight over one shared interface.
2🧵Multus attaches the VF as a secondary pod interfaceThe default CNI handles control traffic; the RoCE VF rides a second interface so RDMA bypasses the host stack entirely.
3⚙️Flexibility is paid for in setup layersThe SR-IOV Network Device Plugin, Multus, and per-VF RoCE tuning are layers a plain K8s cluster never has — and every one is a place the stack can break.
4🏢This is the multi-tenant production defaultWhen isolation between teams is a hard requirement, this design is the one that scales; if it isn't, you're overbuilding.

What's next