Skip to main content

Host Networking

The question this page answers: how does an application running inside a Kubernetes pod get RDMA access to the NIC?

Short answer: SR-IOV creates virtual NICs, Multus attaches them to pods, and the GPU Operator manages the supporting drivers. Long answer below.

After this page, you'll be able to
  1. Distinguish PF from VF — why the host owns the PF and loads the driver against it, while each pod gets a hardware-isolated VF (64–256 per NIC) with its own queue pairs and DMA.
  2. Walk the SR-IOV setup chain — BIOS VT-d / AMD-Vi, intel_iommu=on, num_vfs=N, SR-IOV Network Operator, and the SR-IOV CNI — and debug the classic "VFs don't appear in /sys/class/net/" cmdline miss.
  3. Wire a multi-NIC training pod — Multus chaining Calico on eth0 plus SR-IOV net1net8 per rail, via the k8s.v1.cni.cncf.io/networks annotation and NetworkAttachmentDefinitions.
  4. Order the GPU Operator + Network Operator deployment — the seven-step dependency chain from BIOS to pod annotations, and why getting one link out of order costs you hours.

The whole picture on one page — one physical NIC sliced into VFs, Multus wiring them into the pod, and the setup order you have to get right. The sections below walk each piece.

One physical NIC (ConnectX/Thor/E810) exposes a PF the host owns plus VF0/VF1/VF2 up to 64–256 VFs. A training pod gets eth0 via the Calico CNI (control plane) and net1..net8 via the SR-IOV CNI (one RDMA rail per VF). Multus is the meta-CNI in the middle attaching both. Below, the 7-step setup order: BIOS VT-d, kernel cmdline iommu + num_vfs, driver creates VFs, SR-IOV Operator, SR-IOV CNI, Multus, pod annotation.
The PF stays on the host; each pod gets a hardware-isolated VF. Get the order wrong — usually the cmdline — and VFs never appear in /sys/class/net.

PF vs VF

Modern RDMA NICs — ConnectX-7/8 (NVIDIA/Mellanox), Thor/Thor2 (Broadcom), E810 (Intel) — expose themselves as multiple PCIe functions. All three do SR-IOV and RoCE v2; the fabric doesn't care which one is in the slot — it sees the same DSCP-marked, PFC-honoring RoCE v2 traffic regardless:

  • PF (Physical Function) — the "main" NIC. One PF per physical NIC port. The host OS sees the PF and loads the RDMA driver against it.
  • VF (Virtual Function) — a slice of the NIC, hardware-isolated from other VFs. Each VF has its own queue pairs, memory protection, and (often) its own IP / MAC. A modern NIC can expose 64–256 VFs.

When you "give the pod a NIC," you're really giving it a VF. The PF stays on the host.

┌────────── Physical NIC ──────────┐
│ PF (host owns this) │ ← driver loads here
│ ├── VF 0 (pod A gets this) │
│ ├── VF 1 (pod B gets this) │
│ ├── VF 2 (pod C gets this) │
│ └── ... up to 64–256 VFs │
└────────────────────────────────────┘

VFs are how multiple pods share one physical NIC without contending — each gets isolated hardware queues, isolated DMA, and the throughput each VF can sustain is bounded by the NIC.


SR-IOV

SR-IOV (Single Root I/O Virtualization) is the PCIe spec that lets a device present multiple Virtual Functions. It's been around since 2007 — what's new is using it for RDMA at scale.

The setup chain:

  1. BIOS — enable Intel VT-d / AMD-Vi (IOMMU). Required for any VF passthrough.
  2. Kernelintel_iommu=on (or amd_iommu=on) in the boot cmdline.
  3. Driver — load the NIC driver with num_vfs=N to create N VFs per port.
  4. k8s — install the SR-IOV Network Operator (typically from Red Hat, NVIDIA, or built into the GPU Operator). It manages VF inventory.
  5. CNI — the SR-IOV CNI plugin attaches a VF to a pod when scheduled.

If any of these steps is wrong, you get cryptic errors. The most common debug pattern: SR-IOV looks configured but VFs don't appear in /sys/class/net/. That's usually the kernel cmdline.


Multus

Standard k8s gives each pod one network interface (eth0). That's fine for web workloads. AI training needs:

  • A "control" interface (for k8s control plane, image pulls, logs)
  • One or more "data" interfaces (the RDMA NICs)

Multus is a CNI meta-plugin that lets a pod attach to multiple networks. It chains other CNI plugins (Calico for control, SR-IOV for data) and presents the pod with multiple interfaces.

A typical AI training pod:

Pod ─┬── eth0 (Calico CNI, k8s control plane)
├── net1 (SR-IOV CNI, VF on rail 0)
├── net2 (SR-IOV CNI, VF on rail 1)
├── ...
└── net8 (SR-IOV CNI, VF on rail 7)

Each netN is a VF on a different rail. With rail-optimized topology, this maps GPU-N to Rail N naturally.

The pod spec includes a k8s.v1.cni.cncf.io/networks annotation that tells Multus which NetworkAttachmentDefinitions (NADs) to attach. NADs are k8s resources that describe each network.


GPU Operator (and Network Operator)

NVIDIA's GPU Operator is a Kubernetes operator that automates the entire stack required to run GPU workloads:

  • NVIDIA driver
  • Container runtime hook (so containers see the GPU)
  • DCGM exporter (telemetry)
  • Node Feature Discovery (labels nodes with GPU info)
  • Optional: MIG support, vGPU, time-slicing

The Network Operator is the sibling for the NIC side:

  • Mellanox OFED driver (this Operator is mlx5/Mellanox-specific — Broadcom and Intel NICs use inbox rdma-core or a vendor package instead)
  • RDMA shared device plugin (so pods can request RDMA resources)
  • IB-K8s integration (if InfiniBand)
  • SR-IOV Network Operator integration (for VF management)

One naming clarification worth burning in: OFED is the OpenFabrics rdma-core stack — it's NIC-agnostic. "MLNX_OFED" / "NVIDIA OFED" is just NVIDIA's packaging of it for the mlx5 family. The vendor-specific part isn't the RDMA stack; it's the GPU driver (CUDA for NVIDIA, ROCm for AMD).

You install both. Together they bootstrap a node from "bare hardware" to "ready to schedule RDMA + GPU pods" in minutes. Without them, you're managing drivers, CNI configs, and device plugins by hand — error-prone and slow.


The order that has to be right

Here's the dependency chain. Any link in the wrong order and you'll spend hours debugging:

  1. Hardware enabled — BIOS VT-d / AMD-Vi on, all firmware updated
  2. OS configured — IOMMU, hugepages, RDMA core packages installed
  3. GPU Operator deployed — installs NVIDIA driver
  4. Network Operator deployed — installs Mellanox OFED, sets up VFs
  5. Multus installed — meta-CNI plugin
  6. NetworkAttachmentDefinitions created — one NAD per rail
  7. Pod spec uses the right annotations — Multus reads them, schedules VFs

For first-time setups: budget a week to get this right end-to-end. For repeat setups with automation: minutes.


💡 What you should remember

#ConceptWhy it matters
1🔌PF is the physical NIC (host owns it).VF is a hardware-isolated slice (pod gets it).
2🧩SR-IOV is the PCIe mechanism.Requires BIOS + kernel + driver + Operator + CNI all configured.
3🌐Multus is what lets a pod have multiple network interfacesneeded because RDMA traffic goes through a different NIC than k8s control plane.
4🎛️GPU Operator + Network Operator automate the driver / VF / plugin stack.Don't try to do this by hand at scale.
5⚠️The setup chain has many steps.Most production debugging is "which step was misconfigured?"

Next: What This Curriculum Picks → — bare metal as the teaching baseline, k8s + Multus + SR-IOV as the production variant.