Skip to main content

Configure the Hosts + Kubernetes

The fabric is up. Now the hosts. This page is the chain from bare metal to "a pod can request 8 RDMA NICs and get them" — every link in the order it has to be done.

The chain:

BIOS → kernel cmdline → drivers → RDMA core → GPU Operator → Network Operator → Multus → SR-IOV CNI → NADs → pod

Get any link wrong and the next one fails cryptically. Order matters.


Phase 1: BIOS settings

These are usually one-time at server install. Document them — getting access to BIOS at 3 AM is painful.

SettingValueWhy
VT-d / AMD-ViEnabledIOMMU at the chipset level — required for any VF passthrough
SR-IOVEnabledSome BIOSes have this as a separate toggle
ACS (Access Control Services)EnabledRequired for per-VF isolation in the IOMMU
AER (Advanced Error Reporting)EnabledPCIe fault containment
PCIe Hot PlugDisabledDisable for stability — avoid surprise resets
NUMA Snooping / Memory ModeMatch server vendor recommendationAffects GPUDirect performance

Most server vendors ship a "GPU compute" or "AI training" BIOS profile that sets these correctly. Use it.


Phase 2: Kernel command line

Edit /etc/default/grub (or your distro's equivalent), then update-grub and reboot.

GRUB_CMDLINE_LINUX="intel_iommu=on iommu=pt default_hugepagesz=1G hugepagesz=1G hugepages=64 isolcpus=0-31 nohz_full=0-31"

For AMD: amd_iommu=on iommu=pt.

What each does:

  • intel_iommu=on — enable the IOMMU driver
  • iommu=pt — passthrough mode for host devices (PF stays direct, VFs go through mapping)
  • default_hugepagesz=1G hugepagesz=1G hugepages=64 — reserve 64 × 1 GB hugepages (64 GB) for DMA-able buffers. RDMA at 400 G needs large contiguous regions.
  • isolcpus=0-31 — reserve cores 0-31 for the application (no kernel scheduler interference). Tune for your CPU.
  • nohz_full=0-31 — disable the tick on isolated cores (less jitter for tight RDMA loops)

Verify after reboot:

cat /proc/cmdline # confirm flags applied
dmesg | grep -i iommu # should show IOMMU enabled
cat /proc/meminfo | grep Huge # should show 64 hugepages

Phase 3: NIC driver + RDMA core

The Mellanox / NVIDIA OFED driver is usually installed via the Network Operator (Phase 5), but for manual testing you can install it directly:

# Download from NVIDIA (or use distro packages)
./mlnxofedinstall --upstream-libs --dpdk

# Create VFs on each NIC port
echo 16 > /sys/class/net/enp1s0/device/sriov_numvfs
echo 16 > /sys/class/net/enp2s0/device/sriov_numvfs
# ... repeat for all 8 NIC ports

Verify:

lspci | grep -i mellanox # PFs + VFs visible
rdma link # list all mlx5_N devices
ibv_devinfo # confirm RDMA verbs accessible
ibstat # detailed per-device info

Each port should show 1 PF + N VFs.


Phase 4: Install Kubernetes

This curriculum assumes you have a working k8s cluster. If not, use kubeadm, kubespray, or your provider's installer. For AI training, a few requirements above and beyond default:

  • Kubernetes 1.28+ — required for Topology Manager improvements
  • Container runtime: containerd 1.7+ — works best with GPU Operator
  • CNI: Calico or Cilium for the management (eth0) network
  • Topology Manager policy: single-numa-node — ensures pods get GPU + NIC on the same NUMA

Edit /var/lib/kubelet/config.yaml on each node:

topologyManagerPolicy: single-numa-node
cpuManagerPolicy: static

Then systemctl restart kubelet.


Phase 5: NVIDIA GPU Operator + Network Operator

Install the two operators that bootstrap everything else.

GPU Operator (via Helm):

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install --wait gpu-operator \
-n gpu-operator --create-namespace \
nvidia/gpu-operator \
--set driver.enabled=true \
--set toolkit.enabled=true \
--set devicePlugin.enabled=true \
--set dcgmExporter.enabled=true

This installs:

  • NVIDIA driver (DaemonSet, replaces the host install)
  • Container runtime hook (so containers see GPUs)
  • Device Plugin (registers nvidia.com/gpu resource)
  • DCGM exporter (telemetry)
  • Node Feature Discovery

Network Operator:

helm install --wait network-operator \
-n nvidia-network-operator --create-namespace \
nvidia/network-operator \
--set sriovNetworkOperator.enabled=true \
--set deployCR=true

Then create a NicClusterPolicy to configure the OFED driver and SR-IOV inventory:

apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy
metadata:
name: nic-cluster-policy
namespace: nvidia-network-operator
spec:
ofedDriver:
image: doca-driver
repository: nvcr.io/nvidia/mellanox
version: 24.10-0.5.5.0
rdmaSharedDevicePlugin:
config: |
{
"configList": [
{"resourceName": "rdma_shared_device_a", "rdmaHcaMax": 64, "selectors": {"vendors": ["15b3"]}}
]
}

Verify both operators are healthy:

kubectl get pods -n gpu-operator
kubectl get pods -n nvidia-network-operator

Wait for everything to reach Running before continuing.


Phase 6: SR-IOV Operator + NetworkAttachmentDefinitions

Configure how many VFs come from each PF, and label them as schedulable resources.

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
name: rail-0
namespace: nvidia-network-operator
spec:
resourceName: rail0_vf
nodeSelector:
feature.node.kubernetes.io/network-sriov.capable: "true"
numVfs: 8
nicSelector:
pfNames: ["enp1s0"]
deviceType: netdevice
isRdma: true

Repeat for rail-1 through rail-7, each pointing to a different PF interface name.

Then create one NetworkAttachmentDefinition (NAD) per rail:

apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
name: sriov-rail-0
namespace: ai-training
annotations:
k8s.v1.cni.cncf.io/resourceName: nvidia.com/rail0_vf
spec:
config: |
{
"cniVersion": "0.3.1",
"type": "sriov",
"name": "sriov-rail-0",
"ipam": {
"type": "whereabouts",
"range": "10.50.0.0/16"
}
}

One NAD per rail (sriov-rail-0 through sriov-rail-7). The pod will reference these by name.


Phase 7: The pod spec template

You're done with cluster setup. Now write a training pod spec:

apiVersion: v1
kind: Pod
metadata:
name: training-worker-0
namespace: ai-training
annotations:
k8s.v1.cni.cncf.io/networks: |
[
{"name": "sriov-rail-0", "interface": "net1"},
{"name": "sriov-rail-1", "interface": "net2"},
{"name": "sriov-rail-2", "interface": "net3"},
{"name": "sriov-rail-3", "interface": "net4"},
{"name": "sriov-rail-4", "interface": "net5"},
{"name": "sriov-rail-5", "interface": "net6"},
{"name": "sriov-rail-6", "interface": "net7"},
{"name": "sriov-rail-7", "interface": "net8"}
]
spec:
containers:
- name: trainer
image: nvcr.io/nvidia/pytorch:24.10-py3
resources:
limits:
nvidia.com/gpu: 8
nvidia.com/rail0_vf: 1
nvidia.com/rail1_vf: 1
nvidia.com/rail2_vf: 1
nvidia.com/rail3_vf: 1
nvidia.com/rail4_vf: 1
nvidia.com/rail5_vf: 1
nvidia.com/rail6_vf: 1
nvidia.com/rail7_vf: 1
hugepages-1Gi: 16Gi
memory: 1500Gi
securityContext:
capabilities:
add: ["IPC_LOCK", "SYS_NICE"]
env:
- name: NCCL_IB_HCA
value: "mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_7"
- name: NCCL_IB_GID_INDEX
value: "3"
- name: NCCL_IB_QPS_PER_CONNECTION
value: "4"
- name: NCCL_SOCKET_IFNAME
value: "eth0"

The two things people miss:

  1. IPC_LOCK capability — required for RDMA memory pinning. Without it, ibv_reg_mr fails inside the container.
  2. NCCL env vars — without NCCL_IB_HCA, NCCL only uses the first NIC it finds. With it, you get all 8.

Tune DCQCN on each host

Once the cluster is up, enable DCQCN on the NICs. This is a one-time per-host config:

mlnx_qos -i enp1s0 --trust dscp
echo 1 > /sys/class/net/enp1s0/ecn/roce_np/enable/3
echo 1 > /sys/class/net/enp1s0/ecn/roce_rp/enable/3
# ... repeat for all 8 NIC interfaces

This tells the NIC to:

  • Trust DSCP for QoS classification (match the switch config)
  • Enable Notification Point (generate CNPs on incoming CE-marked packets)
  • Enable Reaction Point (react to CNPs by dialing back send rate)

Both NP and RP must be enabled — every NIC is both a sender and receiver.


What you should remember

  • The chain has one direction: BIOS → kernel → drivers → operators → Multus → CNI → NAD → pod. Don't skip steps.
  • IPC_LOCK capability in the pod is the #1 forgotten setting.
  • NCCL env vars matter — without NCCL_IB_HCA listing all 8 NICs, NCCL uses one and you lose 87.5% of bandwidth.
  • DCQCN must be enabled on every NIC — the switch config alone isn't enough.
  • Capture a working pod spec as a template. Most ops debugging is "diff this against the known-good spec."

Next: Validate & Run the First Training Job → — the validation pyramid. Prove the cluster works at every layer before trusting it.