Deployment Models
You have the hardware. You have the protocols. You have the topology. The remaining question: how do the GPUs and NICs actually become available to the training application?
There are five answers, ranging from "physical server with nothing in the way" to "click a button in AWS console." Each has tradeoffs. Most large operators use two or three of them simultaneously.
The five deployment models
1. Bare metal — nothing in the way
The simplest. The training framework runs directly on the host OS (Linux). The NIC and GPU are exposed via kernel drivers (RDMA core, NVIDIA driver). No virtualization, no containers, no orchestration.
Pros:
- Lowest overhead — no virt tax, no container abstraction
- Easiest to debug — no extra layer between the application and the hardware
- Used by some HPC sites and academic clusters
Cons:
- Hard to share — one job per server, full reservation
- Hard to update — kernel upgrades require taking servers down
- Not multi-tenant — you can't safely run two jobs on one server
Where you see it: HPC, academic research clusters, very small teams.
2. VM with SR-IOV passthrough
The hypervisor (KVM, VMware, Hyper-V) runs the host. The VM gets direct access to a virtual function of the NIC and to the GPU via PCIe passthrough. From inside the VM, it looks like a bare-metal server.
Pros:
- Standard cloud-style isolation
- VMs can be migrated (with effort)
- Easy to share hardware across tenants
Cons:
- VM tax — even with SR-IOV, there's some overhead
- More moving parts (hypervisor, VM, guest kernel, all running RDMA driver)
- Setup complexity — IOMMU, ACS, BIOS settings, hugepages
Where you see it: Most public clouds (AWS, Azure, GCP) under the hood. Some on-prem HPC providers.
3. Container on bare metal
Docker / Podman / containerd directly on the host. The container runs in the host's network namespace (or has its own with Multus). NIC is accessed via host kernel drivers.
Pros:
- Faster than VMs (no hypervisor)
- Easier sharing than bare metal
- Familiar ops model for cloud-native teams
Cons:
- Less isolation than VMs (shared kernel)
- Network namespace handling can be tricky for RDMA
- Not orchestrated by default — you manage scheduling yourself
Where you see it: Smaller orgs that don't need k8s; some HPC sites running containerized MPI jobs.
4. Kubernetes
The dominant pattern for AI training in 2026. Containers run in pods, scheduled by Kubernetes, with networking handled by CNI plugins.
For RDMA specifically, the pod needs:
- A second network interface for the RDMA traffic (the first is for k8s control plane / Pod CIDR).
- An SR-IOV Virtual Function (VF) passed through to the pod, via the SR-IOV CNI plugin.
- Multus to attach multiple network interfaces to one pod.
- NVIDIA GPU Operator + Network Operator to manage the drivers and the VF inventory.
Pros:
- The closest thing to a standard for AI workloads
- Multi-tenant, multi-job, GPU sharing
- Huge ecosystem (operators, schedulers, queueing systems)
Cons:
- A lot of moving parts — Operator, CNI, Multus, SR-IOV, all have to be configured correctly
- Networking debugging is genuinely hard
- Kernel + Operator + driver versions all have to align
Where you see it: Azure, GCP, Oracle (OKE), most enterprise AI clusters, every NVIDIA reference architecture (DGX BasePOD/SuperPOD).
5. Cloud-managed (EFA, A3, Azure HPC SKUs)
You don't deploy anything. You rent the GPUs from a cloud provider, who has already built the cluster and exposed it via their managed service.
- AWS EFA (Elastic Fabric Adapter) — SRD-based, libfabric API. Runs on EC2 P4/P5/Trn1.
- Google A3 / TPU pods — Falcon-based on A3 VMs; ICI on TPU pods.
- Azure HPC SKUs — InfiniBand-based, RoCE-based variants for ND-series.
- Oracle / IBM / Lambda / Coreweave — varies by provider; usually IB or RoCE v2.
Pros:
- Zero infrastructure work — somebody else built it
- Elastic — scale up for one training run, scale down afterward
- Newest hardware first — clouds often have H200 / B100 before on-prem
Cons:
- Expensive at sustained utilization (>50% of the time)
- Network is the provider's design — you don't tune PFC, you don't pick QoS classes
- Vendor lock-in on the fabric API
Where you see it: Startups, research orgs without infra teams, burst capacity for large enterprises.
How to pick
For most teams, the question is between bare metal, Kubernetes on-prem, and cloud-managed:
| Scale | Sustained utilization | Likely pick |
|---|---|---|
| <100 GPUs | Low (research, prototyping) | Cloud-managed |
| <100 GPUs | High | K8s on-prem or bare metal |
| 100–10K GPUs | High | K8s on-prem (the sweet spot) |
| 10K–100K GPUs | Very high (frontier training) | K8s on-prem, often co-designed with the hardware vendor |
| 10K+ GPUs | Bursty | Cloud or hybrid |
What this curriculum picks
The next two pages cover the on-prem Kubernetes stack in detail because:
- It's where the network engineer's job is most visible.
- It's the dominant pattern at the scales where this curriculum's audience operates.
- Cloud-managed clusters hide the network — they're real but they don't teach you anything about fabric design.
If you're on cloud, the concepts still apply — the cloud provider runs Multus / SR-IOV under the hood. You just don't see it directly.
What you should remember
- Bare metal = lowest overhead, hardest to share. HPC and small teams.
- VM with SR-IOV = cloud-style isolation with near-bare-metal NIC performance. Public clouds use this.
- K8s on bare metal = the dominant pattern. SR-IOV CNI + Multus + GPU Operator.
- Cloud-managed = no fabric work for you; the provider built it.
- The right pick depends on scale and utilization — not on which sounds best.
Next: Host Networking → — PF vs VF, SR-IOV, Multus, GPU Operator. How RDMA reaches the application inside a pod.