Provisioning the GPU Host with Config Management
Everything in this section so far — SR-IOV, Multus, NCCL, host-side lossless, multi-rail — assumes the host is already configured. The driver is loaded, OFED is installed, nvidia_peermem is in the kernel, PCIe peer-to-peer works. But someone had to put all that there, on every node, identically, idempotently, and re-converge it after every reboot and reimage.
That "someone" is a config-management module. This page walks one — a Puppet module for GPU hosts — as a concrete worked example, and maps each piece back to the fabric concept it implements.
If you've configured switches with a NOS templating system, this is the same idea applied to the endpoint: declarative desired state, applied by an agent, converging the box to spec.
Watch it happen
Before the file-by-file detail, watch the whole thing run end-to-end on an 8×H100 box — detection, driver, GPUDirect, NVLink fabric, OFED, the safe reboot, and the verification that proves it's fabric-ready:
🖥 Open the guided interactive replay → — the same end-to-end provisioning run, but with a step-by-step panel that follows along: each step's purpose, why it's needed, and the network analogy, all in sync with the terminal. Space = play/pause, click the scrub bar to jump, click any step card to skip to it.
New to GPU hosts? Why each step exists
If the commands above are unfamiliar, here's the purpose of each step in plain language — with the network analogy. A fresh GPU server is like a switch that booted with no NOS, no config, and its ports admin-down: the silicon is there, but nothing knows how to use it yet.
| # | Step | Why it exists | Network analogy |
|---|---|---|---|
| 0 | Detect | Find out which GPU/NIC is present so the right driver is chosen | show inventory / LLDP before templating |
| 1 | Prepare kernel (nvidia_peermem) | Lets the NIC write directly into GPU memory (GPUDirect RDMA). Skip it and bandwidth ~halves | cut-through vs slow store-and-forward |
| 2 | Driver + CUDA | The GPU's operating system — without it the card is a brick | flashing the NOS image |
| 3 | DCGM, toolkit, udev | Lets containers see the GPU + sets device permissions | mgmt plane + ACLs for the interfaces |
| 4 | Disable PCIe ACS | Removes a PCIe "hairpin" that blocks direct GPU↔NIC traffic | disabling split-horizon so two ports talk directly |
| 5 | Core services | Telemetry, keep-driver-warm, health watchdog | SNMP/streaming telemetry + a watchdog |
| 6 | NVSwitch Fabric Manager | Control plane for the scale-up NVLink fabric inside the box | routing daemon for the leaf switch inside the chassis |
| 7 | MIG | Slice one GPU into isolated tenants | VRFs/VLANs on one physical switch |
| 8 | OFED | Brings up the RDMA NIC — the host half of lossless | the other end of the PFC/ECN handshake |
| 9 | Safe reboot | OFED/MIG need a reboot; do it once, after config settles | reload only after write mem, never mid-write |
| ✓ | Verify | Prove data actually moves at full speed, end to end | post-change ping/iperf validation |
:::tip The two big ones for a network engineer
Step 1 (nvidia_peermem) and Step 8 (OFED) are the data-path steps. Together with Step 4 (ACS off) they decide whether your lossless fabric is actually lossless end-to-end — or silently running at half speed because the host side never agreed to the contract.
:::
The mental model: detect → dispatch → converge
A GPU host module is fact-driven. It does not hardcode "this is an H100 box." It discovers the hardware at runtime and applies the matching stack:
lspci / nvidia-smi init.pp vendor pipeline
──────────────── ──► ────────────── ──► ─────────────────────────────
custom facts routing logic drivers · OFED · services
(what GPU? IB?) (which path?) (make it fabric-ready)
This is exactly LLDP-style discovery + a policy decision + applying config — just on a server instead of a switch.
Stage 1 — Detection (custom facts)
Ruby facts run first and answer three questions:
| Fact | Method | Networking analogy |
|---|---|---|
| What GPU is installed? | parse lspci, match against a known device list | inventory / "what's in this slot" |
| Is it SXM or MIG-capable? | nvidia-smi + device map | port capability / speed-tier detection |
| Is there an RDMA NIC? | detect Infiniband/Mellanox hardware | "is this an RDMA-capable port" |
The GPU fact returns a structured result the rest of the module keys off:
{
vendor => 'nvidia', # or 'amd'
nvidia => {
present => true,
gpu_family => 'H100',
sxm_capable => true, # has NVSwitch → needs fabric manager
mig_capable => true, # can be sliced
},
error => nil,
}
Stage 2 — Dispatch (the entry class)
The entry manifest reads those facts and routes. Critically, it never hard-fails — an unsupported or absent GPU emits a notice and the node converges to a no-op, exactly like a switch template that skips a feature the platform doesn't support:
if has_infiniband_fact { include gpu::ofed } # RDMA NIC present → enable RoCE/IB
case $gpu['vendor'] {
'nvidia': { contain gpu::nvidia } # NVIDIA driver pipeline
'amd': { contain gpu::amd } # ROCm pipeline
default: { notify { 'no supported GPU': } } # report, don't fail
}
Stage 3 — Converge (the vendor pipeline)
For NVIDIA, a strictly ordered chain — each link must finish before the next, the same way you'd never enable PFC before the queue exists:
params → prepare → cuda → install → services
The manifest map
This is the heart of the page: each manifest, what it does, and why a network engineer should care.
| Manifest | What it configures | Curriculum link | Why it's really a network concern |
|---|---|---|---|
facts/gpu.rb | GPU family / SXM / MIG detection | GPU & Server | Endpoint discovery — you can't configure a port you haven't identified |
params.pp | All OS-specific package names & versions, one place | Linux for NetEng | The "platform abstraction" — like per-NOS template branches |
nvidia/cuda.pp | CUDA driver + libraries | GPU & Server | The base "NOS" of the accelerator |
nvidia/prepare.pp | modprobe config, dracut, nvidia_peermem | NCCL & GPUDirect | Enables GPUDirect RDMA — NIC DMAs straight into GPU memory |
nvidia/services/disable_acs.pp | Disables PCIe ACS | GPU & Server (PCIe) | Opens the PCIe peer-to-peer path GPUDirect rides on |
ofed.pp | Installs OFED (RDMA NIC stack) | RDMA · RoCE v2 | The host half of "lossless" — no OFED, no RoCE |
nvidia/services/fabric.pp | NVSwitch Fabric Manager (SXM only) | GPU & Server (NVLink) | Control plane for the scale-up fabric inside the box |
nvidia/services/mig.pp | MIG slicing | Inference Networking | VRF-style tenant isolation of one GPU |
nvidia/services/dcgm.pp + health.pp | DCGM agent + health watcher | Production Operations | Endpoint telemetry — your "is the link healthy", for the GPU |
nvidia/services/persistence.pp | nvidia-persistenced | Production Operations | Keeps the driver warm so first-job latency isn't a cold start |
reboot.pp | Safe, lock-aware reboot handler | Production Operations | Reboots (OFED/MIG need them) without colliding with the config agent |
The three pieces that are network config in disguise
If you read nothing else, read this. Three manifests directly determine whether your lossless fabric is actually lossless end-to-end.
1. OFED — the host half of lossless
You can configure perfect PFC, ECN, and DCQCN on every switch (see Switch QoS). It buys you nothing if the NIC isn't brought up to honor it. OFED is the driver + userspace stack (rdma-core, verbs, the mlx5 driver) that makes the NIC a participant in the lossless contract.
Switch side (you): PFC · ECN marking · DCQCN reaction point on the switch
the wire ◄──────────────────────────────────────────────►
Host side (OFED): NIC honors PFC pause · reacts to ECN (CNP) · sets DSCP/SL
Lossless is a two-ended agreement. OFED is the host signing it.
2. nvidia_peermem + ACS disable — the GPUDirect data path
Covered in depth in NCCL & GPUDirect, but here's where it gets installed:
prepare.pploadsnvidia_peermem— the kernel bridge between the NVIDIA driver andrdma-core. Without it, the NIC can't DMA into GPU memory and every transfer bounces through host DRAM (≈ half bandwidth at 400 G+).disable_acs.ppturns off PCIe Access Control Services. ACS is a PCIe "split-horizon" that forces peer-to-peer traffic up to the root complex and back. With it on, GPU↔NIC peer-to-peer DMA is broken even ifnvidia_peermemis loaded.
Both must be right or GPUDirect silently degrades. This is pure data-path config — it just happens to live in a host manifest instead of a switch template.
3. Fabric Manager — the other fabric
There are two fabrics in an AI cluster:
┌─────────────────────────────────────────┐
│ GPU server │
│ GPU ── NVLink ── NVSwitch ── NVLink ── GPU ◄── scale-UP fabric
│ │ │ (fabric.pp owns this)
│ NIC ── PCIe │
└────────────┼──────────────────────────────┘
│
RoCE/IB Ethernet fabric ◄── scale-OUT fabric (you design this)
The scale-out fabric is your spine-leaf RoCE network. The scale-up fabric is NVLink + NVSwitch inside the chassis. NVSwitch is a switch; Fabric Manager is its control plane — it discovers the NVLink topology, programs the routing, and heals around failed links. fabric.pp brings it up, but only on SXM boxes (PCIe cards have no NVSwitch). If you understand a leaf switch, you already understand NVSwitch — this manifest is its ZTP.
Why the params/dispatch split matters
One design point worth calling out, because it mirrors good NOS templating. All platform differences are funneled into one file (params.pp) behind a case on the OS. Everything else reads abstract variables:
# params.pp — the ONLY place OS knowledge lives
case $facts['os']['name'] {
'OS-A': { $cuda_pkg = 'cuda' }
'OS-B': { $cuda_pkg = 'cuda-open' }
default: { fail("unsupported OS") }
}
# everywhere else — platform-agnostic
package { $gpu::params::cuda_pkg: ensure => installed }
This is the same discipline as keeping per-vendor differences in template variables instead of scattering if junos / if eos through your configs. Add a new OS or GPU in one place; the pipeline doesn't change.
How to verify the host is fabric-ready
The config-management run is "done" — but is the host actually a working RoCE endpoint? The checks, in order:
# 1. Driver + GPU present
nvidia-smi # GPUs enumerate, driver loaded
# 2. GPUDirect bridge loaded
lsmod | grep nvidia_peermem # must appear
# 3. RDMA stack up (OFED)
ibv_devices # NICs listed
rdma link show # link state ACTIVE
# 4. PCIe peer-to-peer path is direct
nvidia-smi topo -m # GPU↔NIC pairs show PIX/PXB, not SYS
# 5. Scale-up fabric (SXM only)
nvidia-smi -q | grep -i fabric # Fabric Manager: Success
# 6. Telemetry
dcgmi discovery -l # DCGM sees the GPUs
# 7. The real test — end to end
all_reduce_perf -b 8 -e 1G -g 8 # near line-rate × NIC count
If step 7 is half what you expect, walk back up: almost always it's step 2 (nvidia_peermem) or step 4 (ACS not disabled).
What you should remember
- A GPU host doesn't configure itself. A fact-driven config-management module detects the hardware and converges the box to a fabric-ready state — same shape as NOS templating, applied to the endpoint.
- OFED is the host half of lossless. Your switch QoS is one end of a two-ended agreement; the NIC stack is the other.
- GPUDirect needs two things installed correctly —
nvidia_peermemand PCIe ACS disabled. Get either wrong and bandwidth silently halves. - There are two fabrics. You design the scale-out RoCE network; Fabric Manager runs the scale-up NVLink fabric inside the chassis.
- Push platform differences into one params file. Everything else stays platform-agnostic — the same discipline as good config templating.
Next: Production Operations — what to monitor on these hosts once they're live, and the 3 AM playbooks. Or revisit NCCL & GPUDirect to see the data path this provisioning enables.