Multus and Multi-NIC Pods
A standard Kubernetes pod has one network interface (eth0). For web workloads that's enough. For AI training, where one pod needs 8 RDMA rails plus a control plane interface, it's nowhere near enough.
Multus is the meta-CNI that solves this. It's not a CNI plugin itself — it chains other CNI plugins together so a pod can have multiple interfaces, each managed by a different plugin.
The architecture
┌─── Pod ────┐
│ │
┌── eth0 ──────┤ Calico CNI │ ← k8s control plane (Pod CIDR)
│ │ │
├── net1 ──────┤ SR-IOV CNI │ ← VF on rail 0
│ │ │
├── net2 ──────┤ SR-IOV CNI │ ← VF on rail 1
│ │ │
├── net3 ──────┤ SR-IOV CNI │ ← VF on rail 2
│ ... │ │
└── net8 ──────┤ SR-IOV CNI │ ← VF on rail 7
└────────────┘
- Multus is configured as the primary CNI (it's what kubelet calls).
- Multus then delegates to Calico (or Flannel, Cilium, etc.) for the default
eth0. - For each additional interface, Multus delegates to SR-IOV CNI, which moves a VF into the pod's network namespace.
The pod ends up with 9 interfaces. Each is independently routable, has its own MAC/IP, and looks like a "real" NIC from inside the pod.
NetworkAttachmentDefinitions
You don't tell pods "give me a VF" directly. You tell them "give me network sriov-rail-0," and the cluster has a NetworkAttachmentDefinition (NAD) that says what that means.
A NAD is a k8s custom resource:
apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
name: sriov-rail-0
namespace: ai-training
annotations:
k8s.v1.cni.cncf.io/resourceName: nvidia.com/rail0_vf
spec:
config: '{
"cniVersion": "0.3.1",
"type": "sriov",
"name": "sriov-rail-0",
"ipam": {
"type": "whereabouts",
"range": "10.50.0.0/16",
"exclude": ["10.50.0.0/24"]
}
}'
Breaking this down:
resourceName— k8s resource type to allocate from. The SR-IOV operator pre-creates this resource for the rail-0 VFs available on each node.type: sriov— the CNI plugin Multus will delegate to for this network.ipam— IP address management.whereaboutsis the standard for "give me an IP from this range."
You'll have one NAD per rail — sriov-rail-0 through sriov-rail-7 for an 8-rail cluster.
The pod spec
The pod requests the networks via annotation, and the VFs via container resources:
apiVersion: v1
kind: Pod
metadata:
name: training-worker-0
namespace: ai-training
annotations:
k8s.v1.cni.cncf.io/networks: |
[
{"name": "sriov-rail-0", "interface": "net1"},
{"name": "sriov-rail-1", "interface": "net2"},
{"name": "sriov-rail-2", "interface": "net3"},
{"name": "sriov-rail-3", "interface": "net4"},
{"name": "sriov-rail-4", "interface": "net5"},
{"name": "sriov-rail-5", "interface": "net6"},
{"name": "sriov-rail-6", "interface": "net7"},
{"name": "sriov-rail-7", "interface": "net8"}
]
spec:
containers:
- name: trainer
image: nvcr.io/nvidia/pytorch:24.04-py3
resources:
limits:
nvidia.com/gpu: 8
nvidia.com/rail0_vf: 1
nvidia.com/rail1_vf: 1
nvidia.com/rail2_vf: 1
nvidia.com/rail3_vf: 1
nvidia.com/rail4_vf: 1
nvidia.com/rail5_vf: 1
nvidia.com/rail6_vf: 1
nvidia.com/rail7_vf: 1
securityContext:
capabilities:
add: ["IPC_LOCK", "SYS_NICE"]
Key parts:
networksannotation — names which NADs to attach and what to call them inside the pod (net1–net8).resources.limits— counts the requested VFs from each rail's pool. The scheduler ensures the node has them available.IPC_LOCKcapability — required for the RDMA verbs to pin memory (registered memory regions).SYS_NICE— for NCCL to set thread priorities.
Once the pod starts, inside the container you'll see all 9 interfaces with ip link, and ibv_devinfo will show the RDMA devices.
The IP plan
A typical IP layout for an 8-rail cluster:
| Rail | Subnet (per rail) | Scope |
|---|---|---|
| Rail 0 | 10.50.0.0/16 | Cluster-wide L3 |
| Rail 1 | 10.51.0.0/16 | Cluster-wide L3 |
| Rail 2 | 10.52.0.0/16 | Cluster-wide L3 |
| ... | ... | ... |
| Rail 7 | 10.57.0.0/16 | Cluster-wide L3 |
| K8s Pod CIDR | 10.100.0.0/16 | Standard k8s control plane |
Each rail is a separate L3 subnet, routed through BGP underlay across the spine. The training application uses the IPs on net1...net8 to connect to peers' VFs on the same rail.
What can go wrong
Common Multus + SR-IOV bugs:
| Symptom | Cause | Fix |
|---|---|---|
Pod stays in ContainerCreating, event: "no resources available" | SR-IOV operator hasn't registered VFs as a resource | Check operator pod logs; verify NIC has VFs |
Pod starts but net1–net8 don't appear | NAD name typo in annotation | Match annotation name to NAD metadata.name exactly |
ibv_devinfo shows nothing inside the pod | IPC_LOCK capability missing | Add to securityContext.capabilities |
| RDMA works but throughput is half | Pod scheduled across NUMA from the GPU | Add NUMA topology hints; use Topology Manager |
| Pods can ping each other but RDMA hangs | DSCP / priority not set on the VF | Configure mlnx_qos --trust dscp on PF (inherited by VFs) |
What you should remember
- Multus = meta-CNI that chains other CNI plugins so a pod can have multiple interfaces.
- NetworkAttachmentDefinition (NAD) describes each additional network — type, IPAM, resource pool.
- The pod spec's
k8s.v1.cni.cncf.io/networksannotation names which NADs to attach. resources.limitsclaims the VFs — the scheduler ensures availability.IPC_LOCKcapability is required for RDMA memory pinning. Almost everyone forgets this on first try.- One subnet per rail — keeps L3 routing simple and matches the physical rail topology.
Next: NCCL and GPUDirect Configuration → — making sure NCCL picks the right NICs and uses GPUDirect for zero-copy NIC ↔ GPU memory transfers.