What This Curriculum Picks
Two layers. Bare metal as the teaching baseline — simpler, fewer moving parts, lets the reader see how RDMA + GPU + switch actually behave without orchestration noise. Then Kubernetes + Multus + SR-IOV + GPU Operator as the production variant — what you'll actually deploy.
- Justify the bare-metal-first baseline — the host kernel running
mlx5_coreagainst the PF, applications onlibibverbs, and observing verbs, QPs, MRs, PFC, and DCQCN withibv_devinfo,ibstat, andperfteston one node. - Read the production stack top to bottom — SR-IOV NICs, NVIDIA OFED / GPU Operator, SR-IOV Network Operator, Multus, the SR-IOV CNI, and NFD / Volcano / Kueue scheduling, plus the
k8s.v1.cni.cncf.io/networkspod spec wiring 8 VFs to 8 GPUs. - Remap the same concepts to a different stack — bare-metal-only, KVM SR-IOV passthrough, cloud-managed (
EFA, Azure HPC,A3), or OpenShift — treating each as "the bare-metal model plus an isolation layer."
Layer 1: Bare metal foundation
Why teach this first: RDMA and the AI fabric exist independent of Kubernetes. If you don't first understand the wire and the host, k8s networking becomes mystifying.
On a bare-metal node:
- The host kernel runs the RDMA driver against the PF directly —
mlx5_core(NVIDIA/Mellanox ConnectX),bnxt_re(Broadcom Thor), orirdma(Intel E810). - Applications open RDMA verbs via
libibverbsagainst the PF. - One application has the whole NIC. Simple, fast, debuggable.
rdma,ibv_devinfo,ibstat,perftest(ib_write_bw,ib_read_bw) all work directly.
This is the model for HPC and for learning. Every concept this curriculum teaches — verbs, QPs, MRs, PFC, DCQCN — you can observe and tune on a single bare-metal node with two NICs and a switch.
Layer 2: Kubernetes production variant
Why this is the production answer: every operator running AI training at scale wants multi-tenant sharing, scheduled workloads, declarative config, and rolling upgrades. Kubernetes provides all four. The cost is operational complexity.
The production stack:
| Layer | Component | What it does |
|---|---|---|
| Hardware | RDMA NICs with SR-IOV enabled (64+ VFs) | Hardware-isolates each pod's NIC traffic |
| Driver | NVIDIA OFED / GPU Operator | Loads RDMA + GPU drivers consistently |
| VF management | SR-IOV Network Operator | Provisions and tracks VFs across nodes |
| Multi-NIC | Multus CNI | Attaches multiple network interfaces to one pod |
| NIC attach | SR-IOV CNI plugin | Moves a VF into a pod's network namespace |
| GPU attach | NVIDIA container hook | Exposes GPUs to the container runtime |
| Scheduling | K8s + NFD + (optionally) Volcano / Kueue | Places GPU + NIC + memory affinity together |
The pod spec ends up looking like:
metadata:
annotations:
k8s.v1.cni.cncf.io/networks: |
[
{"name": "sriov-rail-0", "interface": "net1"},
{"name": "sriov-rail-1", "interface": "net2"},
...
{"name": "sriov-rail-7", "interface": "net8"}
]
spec:
containers:
- resources:
limits:
nvidia.com/gpu: 8
rdma/rdma_shared_device_a: 8
8 VFs (one per rail) attached as net1–net8, 8 GPUs requested. The container has the same NIC visibility it would on bare metal — but inside a pod, with k8s scheduling and isolation around it.
If your stack differs
The same curriculum content still applies, with different wiring:
- Bare-metal only → Skip Multus / SR-IOV CNI. RDMA + GPU concepts unchanged.
- VM-based (KVM with SR-IOV passthrough) → Same VFs, attached at the hypervisor instead of by k8s.
- Cloud-managed (AWS EFA, Azure HPC SKU, GCP A3) → The cloud built the cluster. You don't see PFC tuning or rail topology, but the concepts (verbs, QPs, AllReduce, tail latency) still rule. One caveat: AWS EFA is a custom NIC that runs
libfabric+ its own SRD transport, not standard RoCE v2 verbs — so the wire protocol differs even though the AllReduce-and-tail-latency mental model carries straight over. - OpenShift → Adds OpenShift Network Operator and SR-IOV resources, but underlying mechanics unchanged.
The principle: once you know the bare-metal model, every other deployment model is "the same plus an isolation layer."
Why this combination is the production pick
- k8s + Multus + SR-IOV is the most-deployed stack for on-prem AI training at scale (Azure, Oracle, NVIDIA reference designs).
- GPU Operator / Network Operator codify hard-won knowledge — they handle the driver / VF / device-plugin chain that used to take a week to get right.
- The skills transfer. A network engineer who understands this stack can debug Microsoft AKS, Oracle OKE, Coreweave, and many on-prem setups with the same mental model.
For bare metal: you'll see it in HPC, in academic clusters, and in the first stages of a new on-prem deployment before k8s comes in.
💡 What you should remember
| # | Concept | Why it matters | |
|---|---|---|---|
| 1 | 🧠 | Learn on bare metal, | ship on Kubernetes. |
| 2 | 🧩 | SR-IOV creates the VFs, | Multus attaches them, CNI plugin moves them into the pod. |
| 3 | 🎛️ | GPU Operator + Network Operator | are not optional at scale — they automate the driver and inventory chain. |
| 4 | 🏷️ | The pod-spec annotation k8s.v1.cni.cncf.io/networks | is the entry point for multi-NIC pods. Knowing it well is most of the operational mastery. |
| 5 | ☁️ | If you're on cloud, the same concepts apply | the provider runs this same stack under the hood. |
Next: Inference Networking → — why serving a trained model is a different network problem than training it: latency-critical request/response, KV-cache movement, and a fabric design that diverges from the training fabric.