Skip to main content

Multus and Multi-NIC Pods

A standard Kubernetes pod has one network interface (eth0). For web workloads that's enough. For AI training, where one pod needs 8 RDMA rails plus a control plane interface, it's nowhere near enough.

Multus is the meta-CNI that solves this. It's not a CNI plugin itself — it chains other CNI plugins together so a pod can have multiple interfaces, each managed by a different plugin.

Multus and SR-IOV CNI architecture. Top: training-job pod with three interfaces — eth0 (Calico/Cilium, k8s control / TCP, pod CIDR), net1 (SR-IOV CNI, rail-0 IP, RoCE v2), net2 (SR-IOV CNI, rail-1 IP, RoCE v2). Middle: host CNI layer — Calico CNI for eth0 (veth + iptables), SR-IOV CNI for net1 and net2 (moves VF into pod netns). Bottom: physical NICs — management NIC for eth0 (25 Gbps TCP), VF on NIC-0 (rail-0, 400 Gbps RoCE), VF on NIC-1 (rail-1, 400 Gbps RoCE).
eth0 stays for the k8s control plane. net1/net2 are RDMA rails — directly tied to NIC VFs via SR-IOV CNI.

Watch a 5-interface pod (eth0 + 4 RDMA rails) get built end-to-end on the rockynet lab simulator — confirm Multus daemonset, apply 4 NetworkAttachmentDefinitions, launch a pod that requests all 4 + 8 GPUs, then kubectl exec ip -br addr showing net1..net4 each on its own rail subnet:

MODULE host-networking · LAB 4Watch the recording — every command, every counter, every output.
After this page, you'll be able to
  1. Explain why Multus exists — it's a meta-CNI that chains plugins so a pod gets eth0 (Calico/Cilium control plane) plus net1net8 RDMA rails instead of a single interface.
  2. Write a NetworkAttachmentDefinitiontype: sriov, the k8s.v1.cni.cncf.io/resourceName annotation, and whereabouts IPAM, one NAD per rail (sriov-rail-0...sriov-rail-7).
  3. Assemble the pod spec — the k8s.v1.cni.cncf.io/networks annotation that names each NAD, resources.limits claiming the VFs, and the IPC_LOCK + SYS_NICE capabilities RDMA needs.
  4. Debug the common failuresContainerCreating from unregistered VFs, missing net1net8 from NAD name typos, and ibv_devinfo empty without IPC_LOCK.

The architecture

┌─── Pod ────┐
│ │
┌── eth0 ──────┤ Calico CNI │ ← k8s control plane (Pod CIDR)
│ │ │
├── net1 ──────┤ SR-IOV CNI │ ← VF on rail 0
│ │ │
├── net2 ──────┤ SR-IOV CNI │ ← VF on rail 1
│ │ │
├── net3 ──────┤ SR-IOV CNI │ ← VF on rail 2
│ ... │ │
└── net8 ──────┤ SR-IOV CNI │ ← VF on rail 7
└────────────┘
  • Multus is configured as the primary CNI (it's what kubelet calls).
  • Multus then delegates to Calico (or Flannel, Cilium, etc.) for the default eth0.
  • For each additional interface, Multus delegates to SR-IOV CNI, which moves a VF into the pod's network namespace.

The pod ends up with 9 interfaces. Each is independently routable, has its own MAC/IP, and looks like a "real" NIC from inside the pod.


NetworkAttachmentDefinitions

You don't tell pods "give me a VF" directly. You tell them "give me network sriov-rail-0," and the cluster has a NetworkAttachmentDefinition (NAD) that says what that means.

A NAD is a k8s custom resource:

apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
name: sriov-rail-0
namespace: ai-training
annotations:
k8s.v1.cni.cncf.io/resourceName: nvidia.com/rail0_vf
spec:
config: '{
"cniVersion": "0.3.1",
"type": "sriov",
"name": "sriov-rail-0",
"ipam": {
"type": "whereabouts",
"range": "10.50.0.0/16",
"exclude": ["10.50.0.0/24"]
}
}'

Breaking this down:

  • resourceName — k8s resource type to allocate from. The SR-IOV operator pre-creates this resource for the rail-0 VFs available on each node.
  • type: sriov — the CNI plugin Multus will delegate to for this network.
  • ipam — IP address management. whereabouts is the standard for "give me an IP from this range."

You'll have one NAD per rail — sriov-rail-0 through sriov-rail-7 for an 8-rail cluster.


The pod spec

The pod requests the networks via annotation, and the VFs via container resources:

apiVersion: v1
kind: Pod
metadata:
name: training-worker-0
namespace: ai-training
annotations:
k8s.v1.cni.cncf.io/networks: |
[
{"name": "sriov-rail-0", "interface": "net1"},
{"name": "sriov-rail-1", "interface": "net2"},
{"name": "sriov-rail-2", "interface": "net3"},
{"name": "sriov-rail-3", "interface": "net4"},
{"name": "sriov-rail-4", "interface": "net5"},
{"name": "sriov-rail-5", "interface": "net6"},
{"name": "sriov-rail-6", "interface": "net7"},
{"name": "sriov-rail-7", "interface": "net8"}
]
spec:
containers:
- name: trainer
image: nvcr.io/nvidia/pytorch:24.04-py3
resources:
limits:
nvidia.com/gpu: 8
nvidia.com/rail0_vf: 1
nvidia.com/rail1_vf: 1
nvidia.com/rail2_vf: 1
nvidia.com/rail3_vf: 1
nvidia.com/rail4_vf: 1
nvidia.com/rail5_vf: 1
nvidia.com/rail6_vf: 1
nvidia.com/rail7_vf: 1
securityContext:
capabilities:
add: ["IPC_LOCK", "SYS_NICE"]

Key parts:

  • networks annotation — names which NADs to attach and what to call them inside the pod (net1net8).
  • resources.limits — counts the requested VFs from each rail's pool. The scheduler ensures the node has them available.
  • IPC_LOCK capability — required for the RDMA verbs to pin memory (registered memory regions).
  • SYS_NICE — for NCCL to set thread priorities.

Once the pod starts, inside the container you'll see all 9 interfaces with ip link, and ibv_devinfo will show the RDMA devices.


The IP plan

A typical IP layout for an 8-rail cluster:

RailSubnet (per rail)Scope
Rail 010.50.0.0/16Cluster-wide L3
Rail 110.51.0.0/16Cluster-wide L3
Rail 210.52.0.0/16Cluster-wide L3
.........
Rail 710.57.0.0/16Cluster-wide L3
K8s Pod CIDR10.100.0.0/16Standard k8s control plane

Each rail is a separate L3 subnet, routed through BGP underlay across the spine. The training application uses the IPs on net1...net8 to connect to peers' VFs on the same rail.


What can go wrong

Common Multus + SR-IOV bugs:

SymptomCauseFix
Pod stays in ContainerCreating, event: "no resources available"SR-IOV operator hasn't registered VFs as a resourceCheck operator pod logs; verify NIC has VFs
Pod starts but net1net8 don't appearNAD name typo in annotationMatch annotation name to NAD metadata.name exactly
ibv_devinfo shows nothing inside the podIPC_LOCK capability missingAdd to securityContext.capabilities
RDMA works but throughput is halfPod scheduled across NUMA from the GPUAdd NUMA topology hints; use Topology Manager
Pods can ping each other but RDMA hangsDSCP / priority not set on the VFConfigure mlnx_qos --trust dscp on PF (inherited by VFs)

💡 What you should remember

#ConceptWhy it matters
1🧩Multus = meta-CNIthat chains other CNI plugins so a pod can have multiple interfaces.
2🏷️NetworkAttachmentDefinition (NAD)describes each additional network — type, IPAM, resource pool.
3📝The pod spec's k8s.v1.cni.cncf.io/networks annotationnames which NADs to attach.
4📦resources.limits claims the VFsthe scheduler ensures availability.
5🔑IPC_LOCK capability is requiredfor RDMA memory pinning. Almost everyone forgets this on first try.
6🌐One subnet per railkeeps L3 routing simple and matches the physical rail topology.

Next: NCCL and GPUDirect Configuration → — making sure NCCL picks the right NICs and uses GPUDirect for zero-copy NIC ↔ GPU memory transfers.