Provisioning the GPU Host with Config Management

Everything in this section so far — SR-IOV, Multus, NCCL, host-side lossless, multi-rail — assumes the host is already configured. The driver is loaded, OFED is installed, nvidia_peermem is in the kernel, PCIe peer-to-peer works. But someone had to put all that there, on every node, identically, idempotently, and re-converge it after every reboot and reimage.

That "someone" is a config-management module. This page walks one — a Puppet module for GPU hosts — as a concrete worked example, and maps each piece back to the fabric concept it implements.

If you've configured switches with a NOS templating system, this is the same idea applied to the endpoint: declarative desired state, applied by an agent, converging the box to spec.

After this page, you'll be able to

Read a fact-driven GPU module — the detect → dispatch → converge flow where Ruby facts (facts/gpu.rb) discover GPU family / SXM / RDMA NIC and the entry class routes without ever hard-failing.
Name the three data-path manifests — prepare.pp (nvidia_peermem), disable_acs.pp (PCIe ACS off), and ofed.pp, and explain why any one wrong makes the fabric silently run at half speed.
Distinguish the two fabrics — the scale-out RoCE network you design versus the scale-up NVLink/NVSwitch fabric fabric.pp runs via Fabric Manager (SXM only).
Verify fabric-readiness end to end — nvidia-smi, lsmod | grep nvidia_peermem, ibv_devices/rdma link show, nvidia-smi topo -m (PIX/PXB not SYS), and all_reduce_perf as the real test.

Watch it happen

Before the file-by-file detail, watch the whole thing run end-to-end on an 8×H100 box — detection, driver, GPUDirect, NVLink fabric, OFED, the safe reboot, and the verification that proves it's fabric-ready:

MODULE 12 · LAB 6Watch the recording — every command, every counter, every output.

🖥 Open the guided interactive replay → — the same end-to-end provisioning run, but with a step-by-step panel that follows along: each step's purpose, why it's needed, and the network analogy, all in sync with the terminal. Space = play/pause, click the scrub bar to jump, click any step card to skip to it.

New to GPU hosts? Why each step exists

If the commands above are unfamiliar, here's the purpose of each step in plain language — with the network analogy. A fresh GPU server is like a switch that booted with no NOS, no config, and its ports admin-down: the silicon is there, but nothing knows how to use it yet.

#	Step	Why it exists	Network analogy
0	Detect	Find out which GPU/NIC is present so the right driver is chosen	`show inventory` / LLDP before templating
1	Prepare kernel (`nvidia_peermem`)	Lets the NIC write directly into GPU memory (GPUDirect RDMA). Skip it and bandwidth ~halves	cut-through vs slow store-and-forward
2	Driver + CUDA	The GPU's operating system — without it the card is a brick	flashing the NOS image
3	DCGM, toolkit, udev	Lets containers see the GPU + sets device permissions	mgmt plane + ACLs for the interfaces
4	Disable PCIe ACS	Removes a PCIe "hairpin" that blocks direct GPU↔NIC traffic	disabling split-horizon so two ports talk directly
5	Core services	Telemetry, keep-driver-warm, health watchdog	SNMP/streaming telemetry + a watchdog
6	NVSwitch Fabric Manager	Control plane for the scale-up NVLink fabric inside the box	routing daemon for the leaf switch inside the chassis
7	MIG	Slice one GPU into isolated tenants	VRFs/VLANs on one physical switch
8	OFED	Brings up the RDMA NIC — the host half of lossless	the other end of the PFC/ECN handshake
9	Safe reboot	OFED/MIG need a reboot; do it once, after config settles	`reload` only after `write mem`, never mid-write
✓	Verify	Prove data actually moves at full speed, end to end	post-change ping/iperf validation

:::tip The two big ones for a network engineer Step 1 (nvidia_peermem) and Step 8 (OFED) are the data-path steps. Together with Step 4 (ACS off) they decide whether your lossless fabric is actually lossless end-to-end — or silently running at half speed because the host side never agreed to the contract. :::

The mental model: detect → dispatch → converge

A GPU host module is fact-driven. It does not hardcode "this is an H100 box." It discovers the hardware at runtime and applies the matching stack:

  lspci / nvidia-smi          init.pp                 vendor pipeline
  ────────────────  ──►  ──────────────  ──►  ─────────────────────────────
  custom facts           routing logic         drivers · OFED · services
  (what GPU? IB?)        (which path?)          (make it fabric-ready)

This is exactly LLDP-style discovery + a policy decision + applying config — just on a server instead of a switch.

Stage 1 — Detection (custom facts)

Ruby facts run first and answer three questions:

Fact	Method	Networking analogy
What GPU is installed?	parse `lspci`, match against a known device list	inventory / "what's in this slot"
Is it SXM or MIG-capable?	`nvidia-smi` + device map	port capability / speed-tier detection
Is there an RDMA NIC?	detect Infiniband/Mellanox hardware	"is this an RDMA-capable port"

The GPU fact returns a structured result the rest of the module keys off:

{
  vendor      => 'nvidia',         # or 'amd'
  nvidia      => {
    present     => true,
    gpu_family  => 'H100',
    sxm_capable => true,           # has NVSwitch → needs fabric manager
    mig_capable => true,           # can be sliced
  },
  error       => nil,
}

Stage 2 — Dispatch (the entry class)

The entry manifest reads those facts and routes. Critically, it never hard-fails — an unsupported or absent GPU emits a notice and the node converges to a no-op, exactly like a switch template that skips a feature the platform doesn't support:

if has_infiniband_fact      { include gpu::ofed }        # RDMA NIC present → enable RoCE/IB
case $gpu['vendor'] {
  'nvidia': { contain gpu::nvidia }                       # NVIDIA driver pipeline
  'amd':    { contain gpu::amd }                          # ROCm pipeline
  default:  { notify { 'no supported GPU': } }            # report, don't fail
}

Stage 3 — Converge (the vendor pipeline)

For NVIDIA, a strictly ordered chain — each link must finish before the next, the same way you'd never enable PFC before the queue exists:

params → prepare → cuda → install → services

The AMD path (gpu::amd) mirrors this shape, just with the ROCm stack swapped in for CUDA: it installs the ROCm driver (the amdgpu DKMS module), the ROCm runtime + libraries, RCCL (AMD's NCCL-equivalent collective library), and amd_peer_mem — the GPUDirect bridge that lets the NIC DMA straight into GPU memory, exactly the role nvidia_peermem plays on the NVIDIA side. The OFED / rdma-core step (ofed.pp) is shared — the RDMA NIC stack is vendor-agnostic, so it runs identically whichever GPU vendor dispatched. Only the GPU half of the pipeline branches.

The manifest map

This is the heart of the page: each manifest, what it does, and why a network engineer should care.

Manifest	What it configures	Curriculum link	Why it's really a network concern
`facts/gpu.rb`	GPU family / SXM / MIG detection	GPU & Server	Endpoint discovery — you can't configure a port you haven't identified
`params.pp`	All OS-specific package names & versions, one place	Linux for NetEng	The "platform abstraction" — like per-NOS template branches
`nvidia/cuda.pp`	CUDA driver + libraries	GPU & Server	The base "NOS" of the accelerator
`nvidia/prepare.pp`	modprobe config, dracut, `nvidia_peermem`	NCCL & GPUDirect	Enables GPUDirect RDMA — NIC DMAs straight into GPU memory
`nvidia/services/disable_acs.pp`	Disables PCIe ACS	GPU & Server (PCIe)	Opens the PCIe peer-to-peer path GPUDirect rides on
`ofed.pp`	Installs OFED (RDMA NIC stack)	RDMA · RoCE v2	The host half of "lossless" — no OFED, no RoCE
`nvidia/services/fabric.pp`	NVSwitch Fabric Manager (SXM only)	GPU & Server (NVLink)	Control plane for the scale-up fabric inside the box
`nvidia/services/mig.pp`	MIG slicing	Inference Networking	VRF-style tenant isolation of one GPU
`nvidia/services/dcgm.pp` + `health.pp`	DCGM agent + health watcher	Production Operations	Endpoint telemetry — your "is the link healthy", for the GPU
`nvidia/services/persistence.pp`	`nvidia-persistenced`	Production Operations	Keeps the driver warm so first-job latency isn't a cold start
`reboot.pp`	Safe, lock-aware reboot handler	Production Operations	Reboots (OFED/MIG need them) without colliding with the config agent

The three pieces that are network config in disguise

If you read nothing else, read this. Three manifests directly determine whether your lossless fabric is actually lossless end-to-end.

1. OFED — the host half of lossless

You can configure perfect PFC, ECN, and DCQCN on every switch (see Switch QoS). It buys you nothing if the NIC isn't brought up to honor it. OFED is the driver + userspace stack (rdma-core, verbs, the mlx5 driver) that makes the NIC a participant in the lossless contract.

Switch side (you):  PFC · ECN marking · DCQCN reaction point on the switch
       the wire  ◄──────────────────────────────────────────────►
Host side (OFED):   NIC honors PFC pause · reacts to ECN (CNP) · sets DSCP/SL

Lossless is a two-ended agreement. OFED is the host signing it.

2. nvidia_peermem + ACS disable — the GPUDirect data path

Covered in depth in NCCL & GPUDirect, but here's where it gets installed:

prepare.pp loads nvidia_peermem — the kernel bridge between the NVIDIA driver and rdma-core. Without it, the NIC can't DMA into GPU memory and every transfer bounces through host DRAM (≈ half bandwidth at 400 G+).
disable_acs.pp turns off PCIe Access Control Services. ACS is a PCIe "split-horizon" that forces peer-to-peer traffic up to the root complex and back. With it on, GPU↔NIC peer-to-peer DMA is broken even if nvidia_peermem is loaded.

Both must be right or GPUDirect silently degrades. This is pure data-path config — it just happens to live in a host manifest instead of a switch template.

3. Fabric Manager — the other fabric

There are two fabrics in an AI cluster:

   ┌─────────────────────────────────────────┐
   │  GPU server                              │
   │   GPU ── NVLink ── NVSwitch ── NVLink ── GPU   ◄── scale-UP fabric
   │            │                             │       (fabric.pp owns this)
   │           NIC ── PCIe                     │
   └────────────┼──────────────────────────────┘
                │
         RoCE/IB Ethernet fabric  ◄── scale-OUT fabric (you design this)

The scale-out fabric is your spine-leaf RoCE network. The scale-up fabric is NVLink + NVSwitch inside the chassis. NVSwitch is a switch; Fabric Manager is its control plane — it discovers the NVLink topology, programs the routing, and heals around failed links. fabric.pp brings it up, but only on SXM boxes (PCIe cards have no NVSwitch). If you understand a leaf switch, you already understand NVSwitch — this manifest is its ZTP.

Why the params/dispatch split matters

One design point worth calling out, because it mirrors good NOS templating. All platform differences are funneled into one file (params.pp) behind a case on the OS. Everything else reads abstract variables:

# params.pp — the ONLY place OS knowledge lives
case $facts['os']['name'] {
  'OS-A':  { $cuda_pkg = 'cuda' }
  'OS-B':  { $cuda_pkg = 'cuda-open' }
  default: { fail("unsupported OS") }
}

# everywhere else — platform-agnostic
package { $gpu::params::cuda_pkg: ensure => installed }

This is the same discipline as keeping per-vendor differences in template variables instead of scattering if junos / if eos through your configs. Add a new OS or GPU in one place; the pipeline doesn't change.

How to verify the host is fabric-ready

The config-management run is "done" — but is the host actually a working RoCE endpoint? The checks, in order:

# 1. Driver + GPU present
nvidia-smi                                  # GPUs enumerate, driver loaded

# 2. GPUDirect bridge loaded
lsmod | grep nvidia_peermem                 # must appear

# 3. RDMA stack up (OFED)
ibv_devices                                 # NICs listed
rdma link show                              # link state ACTIVE

# 4. PCIe peer-to-peer path is direct
nvidia-smi topo -m                          # GPU↔NIC pairs show PIX/PXB, not SYS

# 5. Scale-up fabric (SXM only)
nvidia-smi -q | grep -i fabric              # Fabric Manager: Success

# 6. Telemetry
dcgmi discovery -l                          # DCGM sees the GPUs

# 7. The real test — end to end
all_reduce_perf -b 8 -e 1G -g 8             # near line-rate × NIC count

If step 7 is half what you expect, walk back up: almost always it's step 2 (nvidia_peermem) or step 4 (ACS not disabled).

💡 What you should remember

#		Concept	Why it matters
1	🧠	A GPU host doesn't configure itself.	A fact-driven config-management module detects the hardware and converges the box to a fabric-ready state — same shape as NOS templating, applied to the endpoint.
2	🔌	OFED is the host half of lossless.	Your switch QoS is one end of a two-ended agreement; the NIC stack is the other.
3	⚡	GPUDirect needs two things installed correctly	`nvidia_peermem` and PCIe ACS disabled. Get either wrong and bandwidth silently halves.
4	🧩	There are two fabrics.	You design the scale-out RoCE network; Fabric Manager runs the scale-up NVLink fabric inside the chassis.
5	🛠️	Push platform differences into one params file.	Everything else stays platform-agnostic — the same discipline as good config templating.

Next: Production Operations — what to monitor on these hosts once they're live, and the 3 AM playbooks. Or revisit NCCL & GPUDirect to see the data path this provisioning enables.

Watch it happen​

New to GPU hosts? Why each step exists​

The mental model: detect → dispatch → converge​

Stage 1 — Detection (custom facts)​

Stage 2 — Dispatch (the entry class)​

Stage 3 — Converge (the vendor pipeline)​

The manifest map​

The three pieces that are network config in disguise​

1. OFED — the host half of lossless​

2. nvidia_peermem + ACS disable — the GPUDirect data path​

3. Fabric Manager — the other fabric​

Why the params/dispatch split matters​

How to verify the host is fabric-ready​

💡 What you should remember​