Host Networking
The question this page answers: how does an application running inside a Kubernetes pod get RDMA access to the NIC?
Short answer: SR-IOV creates virtual NICs, Multus attaches them to pods, and the GPU Operator manages the supporting drivers. Long answer below.
- Distinguish PF from VF — why the host owns the
PFand loads the driver against it, while each pod gets a hardware-isolatedVF(64–256 per NIC) with its own queue pairs and DMA. - Walk the SR-IOV setup chain — BIOS VT-d / AMD-Vi,
intel_iommu=on,num_vfs=N, SR-IOV Network Operator, and the SR-IOV CNI — and debug the classic "VFs don't appear in/sys/class/net/" cmdline miss. - Wire a multi-NIC training pod — Multus chaining Calico on
eth0plus SR-IOVnet1–net8per rail, via thek8s.v1.cni.cncf.io/networksannotation and NetworkAttachmentDefinitions. - Order the GPU Operator + Network Operator deployment — the seven-step dependency chain from BIOS to pod annotations, and why getting one link out of order costs you hours.
The whole picture on one page — one physical NIC sliced into VFs, Multus wiring them into the pod, and the setup order you have to get right. The sections below walk each piece.
PF vs VF
Modern RDMA NICs — ConnectX-7/8 (NVIDIA/Mellanox), Thor/Thor2 (Broadcom), E810 (Intel) — expose themselves as multiple PCIe functions. All three do SR-IOV and RoCE v2; the fabric doesn't care which one is in the slot — it sees the same DSCP-marked, PFC-honoring RoCE v2 traffic regardless:
- PF (Physical Function) — the "main" NIC. One PF per physical NIC port. The host OS sees the PF and loads the RDMA driver against it.
- VF (Virtual Function) — a slice of the NIC, hardware-isolated from other VFs. Each VF has its own queue pairs, memory protection, and (often) its own IP / MAC. A modern NIC can expose 64–256 VFs.
When you "give the pod a NIC," you're really giving it a VF. The PF stays on the host.
┌────────── Physical NIC ──────────┐
│ PF (host owns this) │ ← driver loads here
│ ├── VF 0 (pod A gets this) │
│ ├── VF 1 (pod B gets this) │
│ ├── VF 2 (pod C gets this) │
│ └── ... up to 64–256 VFs │
└────────────────────────────────────┘
VFs are how multiple pods share one physical NIC without contending — each gets isolated hardware queues, isolated DMA, and the throughput each VF can sustain is bounded by the NIC.
SR-IOV
SR-IOV (Single Root I/O Virtualization) is the PCIe spec that lets a device present multiple Virtual Functions. It's been around since 2007 — what's new is using it for RDMA at scale.
The setup chain:
- BIOS — enable Intel VT-d / AMD-Vi (IOMMU). Required for any VF passthrough.
- Kernel —
intel_iommu=on(oramd_iommu=on) in the boot cmdline. - Driver — load the NIC driver with
num_vfs=Nto create N VFs per port. - k8s — install the SR-IOV Network Operator (typically from Red Hat, NVIDIA, or built into the GPU Operator). It manages VF inventory.
- CNI — the SR-IOV CNI plugin attaches a VF to a pod when scheduled.
If any of these steps is wrong, you get cryptic errors. The most common debug pattern: SR-IOV looks configured but VFs don't appear in /sys/class/net/. That's usually the kernel cmdline.
Multus
Standard k8s gives each pod one network interface (eth0). That's fine for web workloads. AI training needs:
- A "control" interface (for k8s control plane, image pulls, logs)
- One or more "data" interfaces (the RDMA NICs)
Multus is a CNI meta-plugin that lets a pod attach to multiple networks. It chains other CNI plugins (Calico for control, SR-IOV for data) and presents the pod with multiple interfaces.
A typical AI training pod:
Pod ─┬── eth0 (Calico CNI, k8s control plane)
├── net1 (SR-IOV CNI, VF on rail 0)
├── net2 (SR-IOV CNI, VF on rail 1)
├── ...
└── net8 (SR-IOV CNI, VF on rail 7)
Each netN is a VF on a different rail. With rail-optimized topology, this maps GPU-N to Rail N naturally.
The pod spec includes a k8s.v1.cni.cncf.io/networks annotation that tells Multus which NetworkAttachmentDefinitions (NADs) to attach. NADs are k8s resources that describe each network.
GPU Operator (and Network Operator)
NVIDIA's GPU Operator is a Kubernetes operator that automates the entire stack required to run GPU workloads:
- NVIDIA driver
- Container runtime hook (so containers see the GPU)
- DCGM exporter (telemetry)
- Node Feature Discovery (labels nodes with GPU info)
- Optional: MIG support, vGPU, time-slicing
The Network Operator is the sibling for the NIC side:
- Mellanox OFED driver (this Operator is
mlx5/Mellanox-specific — Broadcom and Intel NICs use inboxrdma-coreor a vendor package instead) - RDMA shared device plugin (so pods can request RDMA resources)
- IB-K8s integration (if InfiniBand)
- SR-IOV Network Operator integration (for VF management)
One naming clarification worth burning in: OFED is the OpenFabrics rdma-core stack — it's NIC-agnostic. "MLNX_OFED" / "NVIDIA OFED" is just NVIDIA's packaging of it for the mlx5 family. The vendor-specific part isn't the RDMA stack; it's the GPU driver (CUDA for NVIDIA, ROCm for AMD).
You install both. Together they bootstrap a node from "bare hardware" to "ready to schedule RDMA + GPU pods" in minutes. Without them, you're managing drivers, CNI configs, and device plugins by hand — error-prone and slow.
The order that has to be right
Here's the dependency chain. Any link in the wrong order and you'll spend hours debugging:
- Hardware enabled — BIOS VT-d / AMD-Vi on, all firmware updated
- OS configured — IOMMU, hugepages, RDMA core packages installed
- GPU Operator deployed — installs NVIDIA driver
- Network Operator deployed — installs Mellanox OFED, sets up VFs
- Multus installed — meta-CNI plugin
- NetworkAttachmentDefinitions created — one NAD per rail
- Pod spec uses the right annotations — Multus reads them, schedules VFs
For first-time setups: budget a week to get this right end-to-end. For repeat setups with automation: minutes.
💡 What you should remember
| # | Concept | Why it matters | |
|---|---|---|---|
| 1 | 🔌 | PF is the physical NIC (host owns it). | VF is a hardware-isolated slice (pod gets it). |
| 2 | 🧩 | SR-IOV is the PCIe mechanism. | Requires BIOS + kernel + driver + Operator + CNI all configured. |
| 3 | 🌐 | Multus is what lets a pod have multiple network interfaces | needed because RDMA traffic goes through a different NIC than k8s control plane. |
| 4 | 🎛️ | GPU Operator + Network Operator automate the driver / VF / plugin stack. | Don't try to do this by hand at scale. |
| 5 | ⚠️ | The setup chain has many steps. | Most production debugging is "which step was misconfigured?" |
Next: What This Curriculum Picks → — bare metal as the teaching baseline, k8s + Multus + SR-IOV as the production variant.