What This Curriculum Picks

Two layers. Bare metal as the teaching baseline — simpler, fewer moving parts, lets the reader see how RDMA + GPU + switch actually behave without orchestration noise. Then Kubernetes + Multus + SR-IOV + GPU Operator as the production variant — what you'll actually deploy.

Layer 1: Bare metal foundation

Why teach this first: RDMA and the AI fabric exist independent of Kubernetes. If you don't first understand the wire and the host, k8s networking becomes mystifying.

On a bare-metal node:

The host kernel runs the RDMA driver (mlx5_core, irdma, etc.) against the PF directly.
Applications open RDMA verbs via libibverbs against the PF.
One application has the whole NIC. Simple, fast, debuggable.
rdma, ibv_devinfo, ibstat, perftest (ib_write_bw, ib_read_bw) all work directly.

This is the model for HPC and for learning. Every concept this curriculum teaches — verbs, QPs, MRs, PFC, DCQCN — you can observe and tune on a single bare-metal node with two NICs and a switch.

Layer 2: Kubernetes production variant

Why this is the production answer: every operator running AI training at scale wants multi-tenant sharing, scheduled workloads, declarative config, and rolling upgrades. Kubernetes provides all four. The cost is operational complexity.

The production stack:

Layer	Component	What it does
Hardware	RDMA NICs with SR-IOV enabled (64+ VFs)	Hardware-isolates each pod's NIC traffic
Driver	NVIDIA OFED / GPU Operator	Loads RDMA + GPU drivers consistently
VF management	SR-IOV Network Operator	Provisions and tracks VFs across nodes
Multi-NIC	Multus CNI	Attaches multiple network interfaces to one pod
NIC attach	SR-IOV CNI plugin	Moves a VF into a pod's network namespace
GPU attach	NVIDIA container hook	Exposes GPUs to the container runtime
Scheduling	K8s + NFD + (optionally) Volcano / Kueue	Places GPU + NIC + memory affinity together

The pod spec ends up looking like:

metadata:
  annotations:
    k8s.v1.cni.cncf.io/networks: |
      [
        {"name": "sriov-rail-0", "interface": "net1"},
        {"name": "sriov-rail-1", "interface": "net2"},
        ...
        {"name": "sriov-rail-7", "interface": "net8"}
      ]
spec:
  containers:
    - resources:
        limits:
          nvidia.com/gpu: 8
          rdma/rdma_shared_device_a: 8

8 VFs (one per rail) attached as net1–net8, 8 GPUs requested. The container has the same NIC visibility it would on bare metal — but inside a pod, with k8s scheduling and isolation around it.

If your stack differs

The same curriculum content still applies, with different wiring:

Bare-metal only → Skip Multus / SR-IOV CNI. RDMA + GPU concepts unchanged.
VM-based (KVM with SR-IOV passthrough) → Same VFs, attached at the hypervisor instead of by k8s.
Cloud-managed (AWS EFA, Azure HPC SKU, GCP A3) → The cloud built the cluster. You don't see PFC tuning or rail topology, but the concepts (verbs, QPs, AllReduce, tail latency) still rule.
OpenShift → Adds OpenShift Network Operator and SR-IOV resources, but underlying mechanics unchanged.

The principle: once you know the bare-metal model, every other deployment model is "the same plus an isolation layer."

Why this combination is the production pick

k8s + Multus + SR-IOV is the most-deployed stack for on-prem AI training at scale (Azure, Oracle, NVIDIA reference designs).
GPU Operator / Network Operator codify hard-won knowledge — they handle the driver / VF / device-plugin chain that used to take a week to get right.
The skills transfer. A network engineer who understands this stack can debug Microsoft AKS, Oracle OKE, Coreweave, and many on-prem setups with the same mental model.

For bare metal: you'll see it in HPC, in academic clusters, and in the first stages of a new on-prem deployment before k8s comes in.

What you should remember

Learn on bare metal, ship on Kubernetes.
SR-IOV creates the VFs, Multus attaches them, CNI plugin moves them into the pod.
GPU Operator + Network Operator are not optional at scale — they automate the driver and inventory chain.
The pod-spec annotation k8s.v1.cni.cncf.io/networks is the entry point for multi-NIC pods. Knowing it well is most of the operational mastery.
If you're on cloud, the same concepts apply — the provider runs this same stack under the hood.

Next: more sections incoming — Switch QoS (PFC / ECN / DCQCN configuration), Host Networking deep dive, Inference Networking, Production Operations. For now, head back to the curriculum index.

Layer 1: Bare metal foundation​

Layer 2: Kubernetes production variant​

If your stack differs​

Why this combination is the production pick​

What you should remember​

Layer 1: Bare metal foundation

Layer 2: Kubernetes production variant

If your stack differs

Why this combination is the production pick

What you should remember