Skip to main content

Deployment Models

You have the hardware. You have the protocols. You have the topology. The remaining question: how do the GPUs and NICs actually become available to the training application?

There are five answers, ranging from "physical server with nothing in the way" to "click a button in AWS console." Each has tradeoffs. Most large operators use two or three of them simultaneously.


The five deployment models

1. Bare metal — nothing in the way

The simplest. The training framework runs directly on the host OS (Linux). The NIC and GPU are exposed via kernel drivers (RDMA core, NVIDIA driver). No virtualization, no containers, no orchestration.

Pros:

  • Lowest overhead — no virt tax, no container abstraction
  • Easiest to debug — no extra layer between the application and the hardware
  • Used by some HPC sites and academic clusters

Cons:

  • Hard to share — one job per server, full reservation
  • Hard to update — kernel upgrades require taking servers down
  • Not multi-tenant — you can't safely run two jobs on one server

Where you see it: HPC, academic research clusters, very small teams.


2. VM with SR-IOV passthrough

The hypervisor (KVM, VMware, Hyper-V) runs the host. The VM gets direct access to a virtual function of the NIC and to the GPU via PCIe passthrough. From inside the VM, it looks like a bare-metal server.

Pros:

  • Standard cloud-style isolation
  • VMs can be migrated (with effort)
  • Easy to share hardware across tenants

Cons:

  • VM tax — even with SR-IOV, there's some overhead
  • More moving parts (hypervisor, VM, guest kernel, all running RDMA driver)
  • Setup complexity — IOMMU, ACS, BIOS settings, hugepages

Where you see it: Most public clouds (AWS, Azure, GCP) under the hood. Some on-prem HPC providers.


3. Container on bare metal

Docker / Podman / containerd directly on the host. The container runs in the host's network namespace (or has its own with Multus). NIC is accessed via host kernel drivers.

Pros:

  • Faster than VMs (no hypervisor)
  • Easier sharing than bare metal
  • Familiar ops model for cloud-native teams

Cons:

  • Less isolation than VMs (shared kernel)
  • Network namespace handling can be tricky for RDMA
  • Not orchestrated by default — you manage scheduling yourself

Where you see it: Smaller orgs that don't need k8s; some HPC sites running containerized MPI jobs.


4. Kubernetes

The dominant pattern for AI training in 2026. Containers run in pods, scheduled by Kubernetes, with networking handled by CNI plugins.

For RDMA specifically, the pod needs:

  • A second network interface for the RDMA traffic (the first is for k8s control plane / Pod CIDR).
  • An SR-IOV Virtual Function (VF) passed through to the pod, via the SR-IOV CNI plugin.
  • Multus to attach multiple network interfaces to one pod.
  • NVIDIA GPU Operator + Network Operator to manage the drivers and the VF inventory.

Pros:

  • The closest thing to a standard for AI workloads
  • Multi-tenant, multi-job, GPU sharing
  • Huge ecosystem (operators, schedulers, queueing systems)

Cons:

  • A lot of moving parts — Operator, CNI, Multus, SR-IOV, all have to be configured correctly
  • Networking debugging is genuinely hard
  • Kernel + Operator + driver versions all have to align

Where you see it: Azure, GCP, Oracle (OKE), most enterprise AI clusters, every NVIDIA reference architecture (DGX BasePOD/SuperPOD).


5. Cloud-managed (EFA, A3, Azure HPC SKUs)

You don't deploy anything. You rent the GPUs from a cloud provider, who has already built the cluster and exposed it via their managed service.

  • AWS EFA (Elastic Fabric Adapter) — SRD-based, libfabric API. Runs on EC2 P4/P5/Trn1.
  • Google A3 / TPU pods — Falcon-based on A3 VMs; ICI on TPU pods.
  • Azure HPC SKUs — InfiniBand-based, RoCE-based variants for ND-series.
  • Oracle / IBM / Lambda / Coreweave — varies by provider; usually IB or RoCE v2.

Pros:

  • Zero infrastructure work — somebody else built it
  • Elastic — scale up for one training run, scale down afterward
  • Newest hardware first — clouds often have H200 / B100 before on-prem

Cons:

  • Expensive at sustained utilization (>50% of the time)
  • Network is the provider's design — you don't tune PFC, you don't pick QoS classes
  • Vendor lock-in on the fabric API

Where you see it: Startups, research orgs without infra teams, burst capacity for large enterprises.


How to pick

For most teams, the question is between bare metal, Kubernetes on-prem, and cloud-managed:

ScaleSustained utilizationLikely pick
<100 GPUsLow (research, prototyping)Cloud-managed
<100 GPUsHighK8s on-prem or bare metal
100–10K GPUsHighK8s on-prem (the sweet spot)
10K–100K GPUsVery high (frontier training)K8s on-prem, often co-designed with the hardware vendor
10K+ GPUsBurstyCloud or hybrid

What this curriculum picks

The next two pages cover the on-prem Kubernetes stack in detail because:

  1. It's where the network engineer's job is most visible.
  2. It's the dominant pattern at the scales where this curriculum's audience operates.
  3. Cloud-managed clusters hide the network — they're real but they don't teach you anything about fabric design.

If you're on cloud, the concepts still apply — the cloud provider runs Multus / SR-IOV under the hood. You just don't see it directly.


What you should remember

  • Bare metal = lowest overhead, hardest to share. HPC and small teams.
  • VM with SR-IOV = cloud-style isolation with near-bare-metal NIC performance. Public clouds use this.
  • K8s on bare metal = the dominant pattern. SR-IOV CNI + Multus + GPU Operator.
  • Cloud-managed = no fabric work for you; the provider built it.
  • The right pick depends on scale and utilization — not on which sounds best.

Next: Host Networking → — PF vs VF, SR-IOV, Multus, GPU Operator. How RDMA reaches the application inside a pod.