Skip to main content

Provisioning the GPU Host with Config Management

Everything in this section so far — SR-IOV, Multus, NCCL, host-side lossless, multi-rail — assumes the host is already configured. The driver is loaded, OFED is installed, nvidia_peermem is in the kernel, PCIe peer-to-peer works. But someone had to put all that there, on every node, identically, idempotently, and re-converge it after every reboot and reimage.

That "someone" is a config-management module. This page walks one — a Puppet module for GPU hosts — as a concrete worked example, and maps each piece back to the fabric concept it implements.

If you've configured switches with a NOS templating system, this is the same idea applied to the endpoint: declarative desired state, applied by an agent, converging the box to spec.

Watch it happen

Before the file-by-file detail, watch the whole thing run end-to-end on an 8×H100 box — detection, driver, GPUDirect, NVLink fabric, OFED, the safe reboot, and the verification that proves it's fabric-ready:

MODULE 12 · LAB 6Watch the recording — every command, every counter, every output.

🖥 Open the guided interactive replay → — the same end-to-end provisioning run, but with a step-by-step panel that follows along: each step's purpose, why it's needed, and the network analogy, all in sync with the terminal. Space = play/pause, click the scrub bar to jump, click any step card to skip to it.

New to GPU hosts? Why each step exists

If the commands above are unfamiliar, here's the purpose of each step in plain language — with the network analogy. A fresh GPU server is like a switch that booted with no NOS, no config, and its ports admin-down: the silicon is there, but nothing knows how to use it yet.

#StepWhy it existsNetwork analogy
0DetectFind out which GPU/NIC is present so the right driver is chosenshow inventory / LLDP before templating
1Prepare kernel (nvidia_peermem)Lets the NIC write directly into GPU memory (GPUDirect RDMA). Skip it and bandwidth ~halvescut-through vs slow store-and-forward
2Driver + CUDAThe GPU's operating system — without it the card is a brickflashing the NOS image
3DCGM, toolkit, udevLets containers see the GPU + sets device permissionsmgmt plane + ACLs for the interfaces
4Disable PCIe ACSRemoves a PCIe "hairpin" that blocks direct GPU↔NIC trafficdisabling split-horizon so two ports talk directly
5Core servicesTelemetry, keep-driver-warm, health watchdogSNMP/streaming telemetry + a watchdog
6NVSwitch Fabric ManagerControl plane for the scale-up NVLink fabric inside the boxrouting daemon for the leaf switch inside the chassis
7MIGSlice one GPU into isolated tenantsVRFs/VLANs on one physical switch
8OFEDBrings up the RDMA NIC — the host half of losslessthe other end of the PFC/ECN handshake
9Safe rebootOFED/MIG need a reboot; do it once, after config settlesreload only after write mem, never mid-write
VerifyProve data actually moves at full speed, end to endpost-change ping/iperf validation

:::tip The two big ones for a network engineer Step 1 (nvidia_peermem) and Step 8 (OFED) are the data-path steps. Together with Step 4 (ACS off) they decide whether your lossless fabric is actually lossless end-to-end — or silently running at half speed because the host side never agreed to the contract. :::


The mental model: detect → dispatch → converge

A GPU host module is fact-driven. It does not hardcode "this is an H100 box." It discovers the hardware at runtime and applies the matching stack:

lspci / nvidia-smi init.pp vendor pipeline
──────────────── ──► ────────────── ──► ─────────────────────────────
custom facts routing logic drivers · OFED · services
(what GPU? IB?) (which path?) (make it fabric-ready)

This is exactly LLDP-style discovery + a policy decision + applying config — just on a server instead of a switch.

Stage 1 — Detection (custom facts)

Ruby facts run first and answer three questions:

FactMethodNetworking analogy
What GPU is installed?parse lspci, match against a known device listinventory / "what's in this slot"
Is it SXM or MIG-capable?nvidia-smi + device mapport capability / speed-tier detection
Is there an RDMA NIC?detect Infiniband/Mellanox hardware"is this an RDMA-capable port"

The GPU fact returns a structured result the rest of the module keys off:

{
vendor => 'nvidia', # or 'amd'
nvidia => {
present => true,
gpu_family => 'H100',
sxm_capable => true, # has NVSwitch → needs fabric manager
mig_capable => true, # can be sliced
},
error => nil,
}

Stage 2 — Dispatch (the entry class)

The entry manifest reads those facts and routes. Critically, it never hard-fails — an unsupported or absent GPU emits a notice and the node converges to a no-op, exactly like a switch template that skips a feature the platform doesn't support:

if has_infiniband_fact { include gpu::ofed } # RDMA NIC present → enable RoCE/IB
case $gpu['vendor'] {
'nvidia': { contain gpu::nvidia } # NVIDIA driver pipeline
'amd': { contain gpu::amd } # ROCm pipeline
default: { notify { 'no supported GPU': } } # report, don't fail
}

Stage 3 — Converge (the vendor pipeline)

For NVIDIA, a strictly ordered chain — each link must finish before the next, the same way you'd never enable PFC before the queue exists:

params → prepare → cuda → install → services

The manifest map

This is the heart of the page: each manifest, what it does, and why a network engineer should care.

ManifestWhat it configuresCurriculum linkWhy it's really a network concern
facts/gpu.rbGPU family / SXM / MIG detectionGPU & ServerEndpoint discovery — you can't configure a port you haven't identified
params.ppAll OS-specific package names & versions, one placeLinux for NetEngThe "platform abstraction" — like per-NOS template branches
nvidia/cuda.ppCUDA driver + librariesGPU & ServerThe base "NOS" of the accelerator
nvidia/prepare.ppmodprobe config, dracut, nvidia_peermemNCCL & GPUDirectEnables GPUDirect RDMA — NIC DMAs straight into GPU memory
nvidia/services/disable_acs.ppDisables PCIe ACSGPU & Server (PCIe)Opens the PCIe peer-to-peer path GPUDirect rides on
ofed.ppInstalls OFED (RDMA NIC stack)RDMA · RoCE v2The host half of "lossless" — no OFED, no RoCE
nvidia/services/fabric.ppNVSwitch Fabric Manager (SXM only)GPU & Server (NVLink)Control plane for the scale-up fabric inside the box
nvidia/services/mig.ppMIG slicingInference NetworkingVRF-style tenant isolation of one GPU
nvidia/services/dcgm.pp + health.ppDCGM agent + health watcherProduction OperationsEndpoint telemetry — your "is the link healthy", for the GPU
nvidia/services/persistence.ppnvidia-persistencedProduction OperationsKeeps the driver warm so first-job latency isn't a cold start
reboot.ppSafe, lock-aware reboot handlerProduction OperationsReboots (OFED/MIG need them) without colliding with the config agent

The three pieces that are network config in disguise

If you read nothing else, read this. Three manifests directly determine whether your lossless fabric is actually lossless end-to-end.

1. OFED — the host half of lossless

You can configure perfect PFC, ECN, and DCQCN on every switch (see Switch QoS). It buys you nothing if the NIC isn't brought up to honor it. OFED is the driver + userspace stack (rdma-core, verbs, the mlx5 driver) that makes the NIC a participant in the lossless contract.

Switch side (you): PFC · ECN marking · DCQCN reaction point on the switch
the wire ◄──────────────────────────────────────────────►
Host side (OFED): NIC honors PFC pause · reacts to ECN (CNP) · sets DSCP/SL

Lossless is a two-ended agreement. OFED is the host signing it.

2. nvidia_peermem + ACS disable — the GPUDirect data path

Covered in depth in NCCL & GPUDirect, but here's where it gets installed:

  • prepare.pp loads nvidia_peermem — the kernel bridge between the NVIDIA driver and rdma-core. Without it, the NIC can't DMA into GPU memory and every transfer bounces through host DRAM (≈ half bandwidth at 400 G+).
  • disable_acs.pp turns off PCIe Access Control Services. ACS is a PCIe "split-horizon" that forces peer-to-peer traffic up to the root complex and back. With it on, GPU↔NIC peer-to-peer DMA is broken even if nvidia_peermem is loaded.

Both must be right or GPUDirect silently degrades. This is pure data-path config — it just happens to live in a host manifest instead of a switch template.

3. Fabric Manager — the other fabric

There are two fabrics in an AI cluster:

┌─────────────────────────────────────────┐
│ GPU server │
│ GPU ── NVLink ── NVSwitch ── NVLink ── GPU ◄── scale-UP fabric
│ │ │ (fabric.pp owns this)
│ NIC ── PCIe │
└────────────┼──────────────────────────────┘

RoCE/IB Ethernet fabric ◄── scale-OUT fabric (you design this)

The scale-out fabric is your spine-leaf RoCE network. The scale-up fabric is NVLink + NVSwitch inside the chassis. NVSwitch is a switch; Fabric Manager is its control plane — it discovers the NVLink topology, programs the routing, and heals around failed links. fabric.pp brings it up, but only on SXM boxes (PCIe cards have no NVSwitch). If you understand a leaf switch, you already understand NVSwitch — this manifest is its ZTP.


Why the params/dispatch split matters

One design point worth calling out, because it mirrors good NOS templating. All platform differences are funneled into one file (params.pp) behind a case on the OS. Everything else reads abstract variables:

# params.pp — the ONLY place OS knowledge lives
case $facts['os']['name'] {
'OS-A': { $cuda_pkg = 'cuda' }
'OS-B': { $cuda_pkg = 'cuda-open' }
default: { fail("unsupported OS") }
}

# everywhere else — platform-agnostic
package { $gpu::params::cuda_pkg: ensure => installed }

This is the same discipline as keeping per-vendor differences in template variables instead of scattering if junos / if eos through your configs. Add a new OS or GPU in one place; the pipeline doesn't change.


How to verify the host is fabric-ready

The config-management run is "done" — but is the host actually a working RoCE endpoint? The checks, in order:

# 1. Driver + GPU present
nvidia-smi # GPUs enumerate, driver loaded

# 2. GPUDirect bridge loaded
lsmod | grep nvidia_peermem # must appear

# 3. RDMA stack up (OFED)
ibv_devices # NICs listed
rdma link show # link state ACTIVE

# 4. PCIe peer-to-peer path is direct
nvidia-smi topo -m # GPU↔NIC pairs show PIX/PXB, not SYS

# 5. Scale-up fabric (SXM only)
nvidia-smi -q | grep -i fabric # Fabric Manager: Success

# 6. Telemetry
dcgmi discovery -l # DCGM sees the GPUs

# 7. The real test — end to end
all_reduce_perf -b 8 -e 1G -g 8 # near line-rate × NIC count

If step 7 is half what you expect, walk back up: almost always it's step 2 (nvidia_peermem) or step 4 (ACS not disabled).


What you should remember

  • A GPU host doesn't configure itself. A fact-driven config-management module detects the hardware and converges the box to a fabric-ready state — same shape as NOS templating, applied to the endpoint.
  • OFED is the host half of lossless. Your switch QoS is one end of a two-ended agreement; the NIC stack is the other.
  • GPUDirect needs two things installed correctlynvidia_peermem and PCIe ACS disabled. Get either wrong and bandwidth silently halves.
  • There are two fabrics. You design the scale-out RoCE network; Fabric Manager runs the scale-up NVLink fabric inside the chassis.
  • Push platform differences into one params file. Everything else stays platform-agnostic — the same discipline as good config templating.

Next: Production Operations — what to monitor on these hosts once they're live, and the 3 AM playbooks. Or revisit NCCL & GPUDirect to see the data path this provisioning enables.