Skip to main content

What This Curriculum Picks

Two layers. Bare metal as the teaching baseline — simpler, fewer moving parts, lets the reader see how RDMA + GPU + switch actually behave without orchestration noise. Then Kubernetes + Multus + SR-IOV + GPU Operator as the production variant — what you'll actually deploy.


Layer 1: Bare metal foundation

Why teach this first: RDMA and the AI fabric exist independent of Kubernetes. If you don't first understand the wire and the host, k8s networking becomes mystifying.

On a bare-metal node:

  • The host kernel runs the RDMA driver (mlx5_core, irdma, etc.) against the PF directly.
  • Applications open RDMA verbs via libibverbs against the PF.
  • One application has the whole NIC. Simple, fast, debuggable.
  • rdma, ibv_devinfo, ibstat, perftest (ib_write_bw, ib_read_bw) all work directly.

This is the model for HPC and for learning. Every concept this curriculum teaches — verbs, QPs, MRs, PFC, DCQCN — you can observe and tune on a single bare-metal node with two NICs and a switch.


Layer 2: Kubernetes production variant

Why this is the production answer: every operator running AI training at scale wants multi-tenant sharing, scheduled workloads, declarative config, and rolling upgrades. Kubernetes provides all four. The cost is operational complexity.

The production stack:

LayerComponentWhat it does
HardwareRDMA NICs with SR-IOV enabled (64+ VFs)Hardware-isolates each pod's NIC traffic
DriverNVIDIA OFED / GPU OperatorLoads RDMA + GPU drivers consistently
VF managementSR-IOV Network OperatorProvisions and tracks VFs across nodes
Multi-NICMultus CNIAttaches multiple network interfaces to one pod
NIC attachSR-IOV CNI pluginMoves a VF into a pod's network namespace
GPU attachNVIDIA container hookExposes GPUs to the container runtime
SchedulingK8s + NFD + (optionally) Volcano / KueuePlaces GPU + NIC + memory affinity together

The pod spec ends up looking like:

metadata:
annotations:
k8s.v1.cni.cncf.io/networks: |
[
{"name": "sriov-rail-0", "interface": "net1"},
{"name": "sriov-rail-1", "interface": "net2"},
...
{"name": "sriov-rail-7", "interface": "net8"}
]
spec:
containers:
- resources:
limits:
nvidia.com/gpu: 8
rdma/rdma_shared_device_a: 8

8 VFs (one per rail) attached as net1net8, 8 GPUs requested. The container has the same NIC visibility it would on bare metal — but inside a pod, with k8s scheduling and isolation around it.


If your stack differs

The same curriculum content still applies, with different wiring:

  • Bare-metal only → Skip Multus / SR-IOV CNI. RDMA + GPU concepts unchanged.
  • VM-based (KVM with SR-IOV passthrough) → Same VFs, attached at the hypervisor instead of by k8s.
  • Cloud-managed (AWS EFA, Azure HPC SKU, GCP A3) → The cloud built the cluster. You don't see PFC tuning or rail topology, but the concepts (verbs, QPs, AllReduce, tail latency) still rule.
  • OpenShift → Adds OpenShift Network Operator and SR-IOV resources, but underlying mechanics unchanged.

The principle: once you know the bare-metal model, every other deployment model is "the same plus an isolation layer."


Why this combination is the production pick

  1. k8s + Multus + SR-IOV is the most-deployed stack for on-prem AI training at scale (Azure, Oracle, NVIDIA reference designs).
  2. GPU Operator / Network Operator codify hard-won knowledge — they handle the driver / VF / device-plugin chain that used to take a week to get right.
  3. The skills transfer. A network engineer who understands this stack can debug Microsoft AKS, Oracle OKE, Coreweave, and many on-prem setups with the same mental model.

For bare metal: you'll see it in HPC, in academic clusters, and in the first stages of a new on-prem deployment before k8s comes in.


What you should remember

  • Learn on bare metal, ship on Kubernetes.
  • SR-IOV creates the VFs, Multus attaches them, CNI plugin moves them into the pod.
  • GPU Operator + Network Operator are not optional at scale — they automate the driver and inventory chain.
  • The pod-spec annotation k8s.v1.cni.cncf.io/networks is the entry point for multi-NIC pods. Knowing it well is most of the operational mastery.
  • If you're on cloud, the same concepts apply — the provider runs this same stack under the hood.

Next: more sections incoming — Switch QoS (PFC / ECN / DCQCN configuration), Host Networking deep dive, Inference Networking, Production Operations. For now, head back to the curriculum index.