Multus and Multi-NIC Pods
A standard Kubernetes pod has one network interface (eth0). For web workloads that's enough. For AI training, where one pod needs 8 RDMA rails plus a control plane interface, it's nowhere near enough.
Multus is the meta-CNI that solves this. It's not a CNI plugin itself — it chains other CNI plugins together so a pod can have multiple interfaces, each managed by a different plugin.
Watch a 5-interface pod (eth0 + 4 RDMA rails) get built end-to-end on the rockynet lab simulator — confirm Multus daemonset, apply 4 NetworkAttachmentDefinitions, launch a pod that requests all 4 + 8 GPUs, then kubectl exec ip -br addr showing net1..net4 each on its own rail subnet:
- Explain why Multus exists — it's a meta-CNI that chains plugins so a pod gets
eth0(Calico/Cilium control plane) plusnet1–net8RDMA rails instead of a single interface. - Write a NetworkAttachmentDefinition —
type: sriov, thek8s.v1.cni.cncf.io/resourceNameannotation, andwhereaboutsIPAM, one NAD per rail (sriov-rail-0...sriov-rail-7). - Assemble the pod spec — the
k8s.v1.cni.cncf.io/networksannotation that names each NAD,resources.limitsclaiming the VFs, and theIPC_LOCK+SYS_NICEcapabilities RDMA needs. - Debug the common failures —
ContainerCreatingfrom unregistered VFs, missingnet1–net8from NAD name typos, andibv_devinfoempty withoutIPC_LOCK.
The architecture
┌─── Pod ────┐
│ │
┌── eth0 ──────┤ Calico CNI │ ← k8s control plane (Pod CIDR)
│ │ │
├── net1 ──────┤ SR-IOV CNI │ ← VF on rail 0
│ │ │
├── net2 ──────┤ SR-IOV CNI │ ← VF on rail 1
│ │ │
├── net3 ──────┤ SR-IOV CNI │ ← VF on rail 2
│ ... │ │
└── net8 ──────┤ SR-IOV CNI │ ← VF on rail 7
└────────────┘
- Multus is configured as the primary CNI (it's what kubelet calls).
- Multus then delegates to Calico (or Flannel, Cilium, etc.) for the default
eth0. - For each additional interface, Multus delegates to SR-IOV CNI, which moves a VF into the pod's network namespace.
The pod ends up with 9 interfaces. Each is independently routable, has its own MAC/IP, and looks like a "real" NIC from inside the pod.
NetworkAttachmentDefinitions
You don't tell pods "give me a VF" directly. You tell them "give me network sriov-rail-0," and the cluster has a NetworkAttachmentDefinition (NAD) that says what that means.
A NAD is a k8s custom resource:
apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
name: sriov-rail-0
namespace: ai-training
annotations:
k8s.v1.cni.cncf.io/resourceName: nvidia.com/rail0_vf
spec:
config: '{
"cniVersion": "0.3.1",
"type": "sriov",
"name": "sriov-rail-0",
"ipam": {
"type": "whereabouts",
"range": "10.50.0.0/16",
"exclude": ["10.50.0.0/24"]
}
}'
Breaking this down:
resourceName— k8s resource type to allocate from. The SR-IOV operator pre-creates this resource for the rail-0 VFs available on each node.type: sriov— the CNI plugin Multus will delegate to for this network.ipam— IP address management.whereaboutsis the standard for "give me an IP from this range."
You'll have one NAD per rail — sriov-rail-0 through sriov-rail-7 for an 8-rail cluster.
The pod spec
The pod requests the networks via annotation, and the VFs via container resources:
apiVersion: v1
kind: Pod
metadata:
name: training-worker-0
namespace: ai-training
annotations:
k8s.v1.cni.cncf.io/networks: |
[
{"name": "sriov-rail-0", "interface": "net1"},
{"name": "sriov-rail-1", "interface": "net2"},
{"name": "sriov-rail-2", "interface": "net3"},
{"name": "sriov-rail-3", "interface": "net4"},
{"name": "sriov-rail-4", "interface": "net5"},
{"name": "sriov-rail-5", "interface": "net6"},
{"name": "sriov-rail-6", "interface": "net7"},
{"name": "sriov-rail-7", "interface": "net8"}
]
spec:
containers:
- name: trainer
image: nvcr.io/nvidia/pytorch:24.04-py3
resources:
limits:
nvidia.com/gpu: 8
nvidia.com/rail0_vf: 1
nvidia.com/rail1_vf: 1
nvidia.com/rail2_vf: 1
nvidia.com/rail3_vf: 1
nvidia.com/rail4_vf: 1
nvidia.com/rail5_vf: 1
nvidia.com/rail6_vf: 1
nvidia.com/rail7_vf: 1
securityContext:
capabilities:
add: ["IPC_LOCK", "SYS_NICE"]
Key parts:
networksannotation — names which NADs to attach and what to call them inside the pod (net1–net8).resources.limits— counts the requested VFs from each rail's pool. The scheduler ensures the node has them available.IPC_LOCKcapability — required for the RDMA verbs to pin memory (registered memory regions).SYS_NICE— for NCCL to set thread priorities.
Once the pod starts, inside the container you'll see all 9 interfaces with ip link, and ibv_devinfo will show the RDMA devices.
The IP plan
A typical IP layout for an 8-rail cluster:
| Rail | Subnet (per rail) | Scope |
|---|---|---|
| Rail 0 | 10.50.0.0/16 | Cluster-wide L3 |
| Rail 1 | 10.51.0.0/16 | Cluster-wide L3 |
| Rail 2 | 10.52.0.0/16 | Cluster-wide L3 |
| ... | ... | ... |
| Rail 7 | 10.57.0.0/16 | Cluster-wide L3 |
| K8s Pod CIDR | 10.100.0.0/16 | Standard k8s control plane |
Each rail is a separate L3 subnet, routed through BGP underlay across the spine. The training application uses the IPs on net1...net8 to connect to peers' VFs on the same rail.
What can go wrong
Common Multus + SR-IOV bugs:
| Symptom | Cause | Fix |
|---|---|---|
Pod stays in ContainerCreating, event: "no resources available" | SR-IOV operator hasn't registered VFs as a resource | Check operator pod logs; verify NIC has VFs |
Pod starts but net1–net8 don't appear | NAD name typo in annotation | Match annotation name to NAD metadata.name exactly |
ibv_devinfo shows nothing inside the pod | IPC_LOCK capability missing | Add to securityContext.capabilities |
| RDMA works but throughput is half | Pod scheduled across NUMA from the GPU | Add NUMA topology hints; use Topology Manager |
| Pods can ping each other but RDMA hangs | DSCP / priority not set on the VF | Configure mlnx_qos --trust dscp on PF (inherited by VFs) |
💡 What you should remember
| # | Concept | Why it matters | |
|---|---|---|---|
| 1 | 🧩 | Multus = meta-CNI | that chains other CNI plugins so a pod can have multiple interfaces. |
| 2 | 🏷️ | NetworkAttachmentDefinition (NAD) | describes each additional network — type, IPAM, resource pool. |
| 3 | 📝 | The pod spec's k8s.v1.cni.cncf.io/networks annotation | names which NADs to attach. |
| 4 | 📦 | resources.limits claims the VFs | the scheduler ensures availability. |
| 5 | 🔑 | IPC_LOCK capability is required | for RDMA memory pinning. Almost everyone forgets this on first try. |
| 6 | 🌐 | One subnet per rail | keeps L3 routing simple and matches the physical rail topology. |
Next: NCCL and GPUDirect Configuration → — making sure NCCL picks the right NICs and uses GPUDirect for zero-copy NIC ↔ GPU memory transfers.