Multus and Multi-NIC Pods

A standard Kubernetes pod has one network interface (eth0). For web workloads that's enough. For AI training, where one pod needs 8 RDMA rails plus a control plane interface, it's nowhere near enough.

Multus is the meta-CNI that solves this. It's not a CNI plugin itself — it chains other CNI plugins together so a pod can have multiple interfaces, each managed by a different plugin.

The architecture

                  ┌─── Pod ────┐
                  │            │
   ┌── eth0 ──────┤ Calico CNI │  ← k8s control plane (Pod CIDR)
   │              │            │
   ├── net1 ──────┤ SR-IOV CNI │  ← VF on rail 0
   │              │            │
   ├── net2 ──────┤ SR-IOV CNI │  ← VF on rail 1
   │              │            │
   ├── net3 ──────┤ SR-IOV CNI │  ← VF on rail 2
   │   ...        │            │
   └── net8 ──────┤ SR-IOV CNI │  ← VF on rail 7
                  └────────────┘

Multus is configured as the primary CNI (it's what kubelet calls).
Multus then delegates to Calico (or Flannel, Cilium, etc.) for the default eth0.
For each additional interface, Multus delegates to SR-IOV CNI, which moves a VF into the pod's network namespace.

The pod ends up with 9 interfaces. Each is independently routable, has its own MAC/IP, and looks like a "real" NIC from inside the pod.

NetworkAttachmentDefinitions

You don't tell pods "give me a VF" directly. You tell them "give me network sriov-rail-0," and the cluster has a NetworkAttachmentDefinition (NAD) that says what that means.

A NAD is a k8s custom resource:

apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
  name: sriov-rail-0
  namespace: ai-training
  annotations:
    k8s.v1.cni.cncf.io/resourceName: nvidia.com/rail0_vf
spec:
  config: '{
    "cniVersion": "0.3.1",
    "type": "sriov",
    "name": "sriov-rail-0",
    "ipam": {
      "type": "whereabouts",
      "range": "10.50.0.0/16",
      "exclude": ["10.50.0.0/24"]
    }
  }'

Breaking this down:

resourceName — k8s resource type to allocate from. The SR-IOV operator pre-creates this resource for the rail-0 VFs available on each node.
type: sriov — the CNI plugin Multus will delegate to for this network.
ipam — IP address management. whereabouts is the standard for "give me an IP from this range."

You'll have one NAD per rail — sriov-rail-0 through sriov-rail-7 for an 8-rail cluster.

The pod spec

The pod requests the networks via annotation, and the VFs via container resources:

apiVersion: v1
kind: Pod
metadata:
  name: training-worker-0
  namespace: ai-training
  annotations:
    k8s.v1.cni.cncf.io/networks: |
      [
        {"name": "sriov-rail-0", "interface": "net1"},
        {"name": "sriov-rail-1", "interface": "net2"},
        {"name": "sriov-rail-2", "interface": "net3"},
        {"name": "sriov-rail-3", "interface": "net4"},
        {"name": "sriov-rail-4", "interface": "net5"},
        {"name": "sriov-rail-5", "interface": "net6"},
        {"name": "sriov-rail-6", "interface": "net7"},
        {"name": "sriov-rail-7", "interface": "net8"}
      ]
spec:
  containers:
  - name: trainer
    image: nvcr.io/nvidia/pytorch:24.04-py3
    resources:
      limits:
        nvidia.com/gpu: 8
        nvidia.com/rail0_vf: 1
        nvidia.com/rail1_vf: 1
        nvidia.com/rail2_vf: 1
        nvidia.com/rail3_vf: 1
        nvidia.com/rail4_vf: 1
        nvidia.com/rail5_vf: 1
        nvidia.com/rail6_vf: 1
        nvidia.com/rail7_vf: 1
    securityContext:
      capabilities:
        add: ["IPC_LOCK", "SYS_NICE"]

Key parts:

networks annotation — names which NADs to attach and what to call them inside the pod (net1–net8).
resources.limits — counts the requested VFs from each rail's pool. The scheduler ensures the node has them available.
IPC_LOCK capability — required for the RDMA verbs to pin memory (registered memory regions).
SYS_NICE — for NCCL to set thread priorities.

Once the pod starts, inside the container you'll see all 9 interfaces with ip link, and ibv_devinfo will show the RDMA devices.

The IP plan

A typical IP layout for an 8-rail cluster:

Rail	Subnet (per rail)	Scope
Rail 0	`10.50.0.0/16`	Cluster-wide L3
Rail 1	`10.51.0.0/16`	Cluster-wide L3
Rail 2	`10.52.0.0/16`	Cluster-wide L3
...	...	...
Rail 7	`10.57.0.0/16`	Cluster-wide L3
K8s Pod CIDR	`10.100.0.0/16`	Standard k8s control plane

Each rail is a separate L3 subnet, routed through BGP underlay across the spine. The training application uses the IPs on net1...net8 to connect to peers' VFs on the same rail.

What can go wrong

Common Multus + SR-IOV bugs:

Symptom	Cause	Fix
Pod stays in `ContainerCreating`, event: "no resources available"	SR-IOV operator hasn't registered VFs as a resource	Check operator pod logs; verify NIC has VFs
Pod starts but `net1`–`net8` don't appear	NAD name typo in annotation	Match annotation `name` to NAD `metadata.name` exactly
`ibv_devinfo` shows nothing inside the pod	`IPC_LOCK` capability missing	Add to `securityContext.capabilities`
RDMA works but throughput is half	Pod scheduled across NUMA from the GPU	Add NUMA topology hints; use Topology Manager
Pods can ping each other but RDMA hangs	DSCP / priority not set on the VF	Configure `mlnx_qos --trust dscp` on PF (inherited by VFs)

What you should remember

Multus = meta-CNI that chains other CNI plugins so a pod can have multiple interfaces.
NetworkAttachmentDefinition (NAD) describes each additional network — type, IPAM, resource pool.
The pod spec's k8s.v1.cni.cncf.io/networks annotation names which NADs to attach.
resources.limits claims the VFs — the scheduler ensures availability.
IPC_LOCK capability is required for RDMA memory pinning. Almost everyone forgets this on first try.
One subnet per rail — keeps L3 routing simple and matches the physical rail topology.

Next: NCCL and GPUDirect Configuration → — making sure NCCL picks the right NICs and uses GPUDirect for zero-copy NIC ↔ GPU memory transfers.

The architecture​

NetworkAttachmentDefinitions​

The pod spec​

The IP plan​

What can go wrong​

What you should remember​