Skip to main content

What This Curriculum Picks

Two layers. Bare metal as the teaching baseline — simpler, fewer moving parts, lets the reader see how RDMA + GPU + switch actually behave without orchestration noise. Then Kubernetes + Multus + SR-IOV + GPU Operator as the production variant — what you'll actually deploy.

After this page, you'll be able to
  1. Justify the bare-metal-first baseline — the host kernel running mlx5_core against the PF, applications on libibverbs, and observing verbs, QPs, MRs, PFC, and DCQCN with ibv_devinfo, ibstat, and perftest on one node.
  2. Read the production stack top to bottom — SR-IOV NICs, NVIDIA OFED / GPU Operator, SR-IOV Network Operator, Multus, the SR-IOV CNI, and NFD / Volcano / Kueue scheduling, plus the k8s.v1.cni.cncf.io/networks pod spec wiring 8 VFs to 8 GPUs.
  3. Remap the same concepts to a different stack — bare-metal-only, KVM SR-IOV passthrough, cloud-managed (EFA, Azure HPC, A3), or OpenShift — treating each as "the bare-metal model plus an isolation layer."

Layer 1: Bare metal foundation

Why teach this first: RDMA and the AI fabric exist independent of Kubernetes. If you don't first understand the wire and the host, k8s networking becomes mystifying.

On a bare-metal node:

  • The host kernel runs the RDMA driver against the PF directly — mlx5_core (NVIDIA/Mellanox ConnectX), bnxt_re (Broadcom Thor), or irdma (Intel E810).
  • Applications open RDMA verbs via libibverbs against the PF.
  • One application has the whole NIC. Simple, fast, debuggable.
  • rdma, ibv_devinfo, ibstat, perftest (ib_write_bw, ib_read_bw) all work directly.

This is the model for HPC and for learning. Every concept this curriculum teaches — verbs, QPs, MRs, PFC, DCQCN — you can observe and tune on a single bare-metal node with two NICs and a switch.


Layer 2: Kubernetes production variant

Why this is the production answer: every operator running AI training at scale wants multi-tenant sharing, scheduled workloads, declarative config, and rolling upgrades. Kubernetes provides all four. The cost is operational complexity.

The production stack:

LayerComponentWhat it does
HardwareRDMA NICs with SR-IOV enabled (64+ VFs)Hardware-isolates each pod's NIC traffic
DriverNVIDIA OFED / GPU OperatorLoads RDMA + GPU drivers consistently
VF managementSR-IOV Network OperatorProvisions and tracks VFs across nodes
Multi-NICMultus CNIAttaches multiple network interfaces to one pod
NIC attachSR-IOV CNI pluginMoves a VF into a pod's network namespace
GPU attachNVIDIA container hookExposes GPUs to the container runtime
SchedulingK8s + NFD + (optionally) Volcano / KueuePlaces GPU + NIC + memory affinity together

The pod spec ends up looking like:

metadata:
annotations:
k8s.v1.cni.cncf.io/networks: |
[
{"name": "sriov-rail-0", "interface": "net1"},
{"name": "sriov-rail-1", "interface": "net2"},
...
{"name": "sriov-rail-7", "interface": "net8"}
]
spec:
containers:
- resources:
limits:
nvidia.com/gpu: 8
rdma/rdma_shared_device_a: 8

8 VFs (one per rail) attached as net1net8, 8 GPUs requested. The container has the same NIC visibility it would on bare metal — but inside a pod, with k8s scheduling and isolation around it.


If your stack differs

The same curriculum content still applies, with different wiring:

  • Bare-metal only → Skip Multus / SR-IOV CNI. RDMA + GPU concepts unchanged.
  • VM-based (KVM with SR-IOV passthrough) → Same VFs, attached at the hypervisor instead of by k8s.
  • Cloud-managed (AWS EFA, Azure HPC SKU, GCP A3) → The cloud built the cluster. You don't see PFC tuning or rail topology, but the concepts (verbs, QPs, AllReduce, tail latency) still rule. One caveat: AWS EFA is a custom NIC that runs libfabric + its own SRD transport, not standard RoCE v2 verbs — so the wire protocol differs even though the AllReduce-and-tail-latency mental model carries straight over.
  • OpenShift → Adds OpenShift Network Operator and SR-IOV resources, but underlying mechanics unchanged.

The principle: once you know the bare-metal model, every other deployment model is "the same plus an isolation layer."


Why this combination is the production pick

  1. k8s + Multus + SR-IOV is the most-deployed stack for on-prem AI training at scale (Azure, Oracle, NVIDIA reference designs).
  2. GPU Operator / Network Operator codify hard-won knowledge — they handle the driver / VF / device-plugin chain that used to take a week to get right.
  3. The skills transfer. A network engineer who understands this stack can debug Microsoft AKS, Oracle OKE, Coreweave, and many on-prem setups with the same mental model.

For bare metal: you'll see it in HPC, in academic clusters, and in the first stages of a new on-prem deployment before k8s comes in.


💡 What you should remember

#ConceptWhy it matters
1🧠Learn on bare metal,ship on Kubernetes.
2🧩SR-IOV creates the VFs,Multus attaches them, CNI plugin moves them into the pod.
3🎛️GPU Operator + Network Operatorare not optional at scale — they automate the driver and inventory chain.
4🏷️The pod-spec annotation k8s.v1.cni.cncf.io/networksis the entry point for multi-NIC pods. Knowing it well is most of the operational mastery.
5☁️If you're on cloud, the same concepts applythe provider runs this same stack under the hood.

Next: Inference Networking → — why serving a trained model is a different network problem than training it: latency-critical request/response, KV-cache movement, and a fabric design that diverges from the training fabric.