What This Curriculum Picks
Two layers. Bare metal as the teaching baseline — simpler, fewer moving parts, lets the reader see how RDMA + GPU + switch actually behave without orchestration noise. Then Kubernetes + Multus + SR-IOV + GPU Operator as the production variant — what you'll actually deploy.
Layer 1: Bare metal foundation
Why teach this first: RDMA and the AI fabric exist independent of Kubernetes. If you don't first understand the wire and the host, k8s networking becomes mystifying.
On a bare-metal node:
- The host kernel runs the RDMA driver (
mlx5_core,irdma, etc.) against the PF directly. - Applications open RDMA verbs via
libibverbsagainst the PF. - One application has the whole NIC. Simple, fast, debuggable.
rdma,ibv_devinfo,ibstat,perftest(ib_write_bw,ib_read_bw) all work directly.
This is the model for HPC and for learning. Every concept this curriculum teaches — verbs, QPs, MRs, PFC, DCQCN — you can observe and tune on a single bare-metal node with two NICs and a switch.
Layer 2: Kubernetes production variant
Why this is the production answer: every operator running AI training at scale wants multi-tenant sharing, scheduled workloads, declarative config, and rolling upgrades. Kubernetes provides all four. The cost is operational complexity.
The production stack:
| Layer | Component | What it does |
|---|---|---|
| Hardware | RDMA NICs with SR-IOV enabled (64+ VFs) | Hardware-isolates each pod's NIC traffic |
| Driver | NVIDIA OFED / GPU Operator | Loads RDMA + GPU drivers consistently |
| VF management | SR-IOV Network Operator | Provisions and tracks VFs across nodes |
| Multi-NIC | Multus CNI | Attaches multiple network interfaces to one pod |
| NIC attach | SR-IOV CNI plugin | Moves a VF into a pod's network namespace |
| GPU attach | NVIDIA container hook | Exposes GPUs to the container runtime |
| Scheduling | K8s + NFD + (optionally) Volcano / Kueue | Places GPU + NIC + memory affinity together |
The pod spec ends up looking like:
metadata:
annotations:
k8s.v1.cni.cncf.io/networks: |
[
{"name": "sriov-rail-0", "interface": "net1"},
{"name": "sriov-rail-1", "interface": "net2"},
...
{"name": "sriov-rail-7", "interface": "net8"}
]
spec:
containers:
- resources:
limits:
nvidia.com/gpu: 8
rdma/rdma_shared_device_a: 8
8 VFs (one per rail) attached as net1–net8, 8 GPUs requested. The container has the same NIC visibility it would on bare metal — but inside a pod, with k8s scheduling and isolation around it.
If your stack differs
The same curriculum content still applies, with different wiring:
- Bare-metal only → Skip Multus / SR-IOV CNI. RDMA + GPU concepts unchanged.
- VM-based (KVM with SR-IOV passthrough) → Same VFs, attached at the hypervisor instead of by k8s.
- Cloud-managed (AWS EFA, Azure HPC SKU, GCP A3) → The cloud built the cluster. You don't see PFC tuning or rail topology, but the concepts (verbs, QPs, AllReduce, tail latency) still rule.
- OpenShift → Adds OpenShift Network Operator and SR-IOV resources, but underlying mechanics unchanged.
The principle: once you know the bare-metal model, every other deployment model is "the same plus an isolation layer."
Why this combination is the production pick
- k8s + Multus + SR-IOV is the most-deployed stack for on-prem AI training at scale (Azure, Oracle, NVIDIA reference designs).
- GPU Operator / Network Operator codify hard-won knowledge — they handle the driver / VF / device-plugin chain that used to take a week to get right.
- The skills transfer. A network engineer who understands this stack can debug Microsoft AKS, Oracle OKE, Coreweave, and many on-prem setups with the same mental model.
For bare metal: you'll see it in HPC, in academic clusters, and in the first stages of a new on-prem deployment before k8s comes in.
What you should remember
- Learn on bare metal, ship on Kubernetes.
- SR-IOV creates the VFs, Multus attaches them, CNI plugin moves them into the pod.
- GPU Operator + Network Operator are not optional at scale — they automate the driver and inventory chain.
- The pod-spec annotation
k8s.v1.cni.cncf.io/networksis the entry point for multi-NIC pods. Knowing it well is most of the operational mastery. - If you're on cloud, the same concepts apply — the provider runs this same stack under the hood.
Next: more sections incoming — Switch QoS (PFC / ECN / DCQCN configuration), Host Networking deep dive, Inference Networking, Production Operations. For now, head back to the curriculum index.