Configure the Hosts + Kubernetes
The fabric is up. Now the hosts. This page is the chain from bare metal to "a pod can request 8 RDMA NICs and get them" — every link in the order it has to be done.
The chain:
BIOS → kernel cmdline → drivers → RDMA core → GPU Operator → Network Operator → Multus → SR-IOV CNI → NADs → pod
Get any link wrong and the next one fails cryptically. Order matters.
Phase 1: BIOS settings
These are usually one-time at server install. Document them — getting access to BIOS at 3 AM is painful.
| Setting | Value | Why |
|---|---|---|
| VT-d / AMD-Vi | Enabled | IOMMU at the chipset level — required for any VF passthrough |
| SR-IOV | Enabled | Some BIOSes have this as a separate toggle |
| ACS (Access Control Services) | Enabled | Required for per-VF isolation in the IOMMU |
| AER (Advanced Error Reporting) | Enabled | PCIe fault containment |
| PCIe Hot Plug | Disabled | Disable for stability — avoid surprise resets |
| NUMA Snooping / Memory Mode | Match server vendor recommendation | Affects GPUDirect performance |
Most server vendors ship a "GPU compute" or "AI training" BIOS profile that sets these correctly. Use it.
Phase 2: Kernel command line
Edit /etc/default/grub (or your distro's equivalent), then update-grub and reboot.
GRUB_CMDLINE_LINUX="intel_iommu=on iommu=pt default_hugepagesz=1G hugepagesz=1G hugepages=64 isolcpus=0-31 nohz_full=0-31"
For AMD: amd_iommu=on iommu=pt.
What each does:
intel_iommu=on— enable the IOMMU driveriommu=pt— passthrough mode for host devices (PF stays direct, VFs go through mapping)default_hugepagesz=1G hugepagesz=1G hugepages=64— reserve 64 × 1 GB hugepages (64 GB) for DMA-able buffers. RDMA at 400 G needs large contiguous regions.isolcpus=0-31— reserve cores 0-31 for the application (no kernel scheduler interference). Tune for your CPU.nohz_full=0-31— disable the tick on isolated cores (less jitter for tight RDMA loops)
Verify after reboot:
cat /proc/cmdline # confirm flags applied
dmesg | grep -i iommu # should show IOMMU enabled
cat /proc/meminfo | grep Huge # should show 64 hugepages
Phase 3: NIC driver + RDMA core
The Mellanox / NVIDIA OFED driver is usually installed via the Network Operator (Phase 5), but for manual testing you can install it directly:
# Download from NVIDIA (or use distro packages)
./mlnxofedinstall --upstream-libs --dpdk
# Create VFs on each NIC port
echo 16 > /sys/class/net/enp1s0/device/sriov_numvfs
echo 16 > /sys/class/net/enp2s0/device/sriov_numvfs
# ... repeat for all 8 NIC ports
Verify:
lspci | grep -i mellanox # PFs + VFs visible
rdma link # list all mlx5_N devices
ibv_devinfo # confirm RDMA verbs accessible
ibstat # detailed per-device info
Each port should show 1 PF + N VFs.
Phase 4: Install Kubernetes
This curriculum assumes you have a working k8s cluster. If not, use kubeadm, kubespray, or your provider's installer. For AI training, a few requirements above and beyond default:
- Kubernetes 1.28+ — required for Topology Manager improvements
- Container runtime: containerd 1.7+ — works best with GPU Operator
- CNI: Calico or Cilium for the management (eth0) network
- Topology Manager policy:
single-numa-node— ensures pods get GPU + NIC on the same NUMA
Edit /var/lib/kubelet/config.yaml on each node:
topologyManagerPolicy: single-numa-node
cpuManagerPolicy: static
Then systemctl restart kubelet.
Phase 5: NVIDIA GPU Operator + Network Operator
Install the two operators that bootstrap everything else.
GPU Operator (via Helm):
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install --wait gpu-operator \
-n gpu-operator --create-namespace \
nvidia/gpu-operator \
--set driver.enabled=true \
--set toolkit.enabled=true \
--set devicePlugin.enabled=true \
--set dcgmExporter.enabled=true
This installs:
- NVIDIA driver (DaemonSet, replaces the host install)
- Container runtime hook (so containers see GPUs)
- Device Plugin (registers
nvidia.com/gpuresource) - DCGM exporter (telemetry)
- Node Feature Discovery
Network Operator:
helm install --wait network-operator \
-n nvidia-network-operator --create-namespace \
nvidia/network-operator \
--set sriovNetworkOperator.enabled=true \
--set deployCR=true
Then create a NicClusterPolicy to configure the OFED driver and SR-IOV inventory:
apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy
metadata:
name: nic-cluster-policy
namespace: nvidia-network-operator
spec:
ofedDriver:
image: doca-driver
repository: nvcr.io/nvidia/mellanox
version: 24.10-0.5.5.0
rdmaSharedDevicePlugin:
config: |
{
"configList": [
{"resourceName": "rdma_shared_device_a", "rdmaHcaMax": 64, "selectors": {"vendors": ["15b3"]}}
]
}
Verify both operators are healthy:
kubectl get pods -n gpu-operator
kubectl get pods -n nvidia-network-operator
Wait for everything to reach Running before continuing.
Phase 6: SR-IOV Operator + NetworkAttachmentDefinitions
Configure how many VFs come from each PF, and label them as schedulable resources.
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
name: rail-0
namespace: nvidia-network-operator
spec:
resourceName: rail0_vf
nodeSelector:
feature.node.kubernetes.io/network-sriov.capable: "true"
numVfs: 8
nicSelector:
pfNames: ["enp1s0"]
deviceType: netdevice
isRdma: true
Repeat for rail-1 through rail-7, each pointing to a different PF interface name.
Then create one NetworkAttachmentDefinition (NAD) per rail:
apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
name: sriov-rail-0
namespace: ai-training
annotations:
k8s.v1.cni.cncf.io/resourceName: nvidia.com/rail0_vf
spec:
config: |
{
"cniVersion": "0.3.1",
"type": "sriov",
"name": "sriov-rail-0",
"ipam": {
"type": "whereabouts",
"range": "10.50.0.0/16"
}
}
One NAD per rail (sriov-rail-0 through sriov-rail-7). The pod will reference these by name.
Phase 7: The pod spec template
You're done with cluster setup. Now write a training pod spec:
apiVersion: v1
kind: Pod
metadata:
name: training-worker-0
namespace: ai-training
annotations:
k8s.v1.cni.cncf.io/networks: |
[
{"name": "sriov-rail-0", "interface": "net1"},
{"name": "sriov-rail-1", "interface": "net2"},
{"name": "sriov-rail-2", "interface": "net3"},
{"name": "sriov-rail-3", "interface": "net4"},
{"name": "sriov-rail-4", "interface": "net5"},
{"name": "sriov-rail-5", "interface": "net6"},
{"name": "sriov-rail-6", "interface": "net7"},
{"name": "sriov-rail-7", "interface": "net8"}
]
spec:
containers:
- name: trainer
image: nvcr.io/nvidia/pytorch:24.10-py3
resources:
limits:
nvidia.com/gpu: 8
nvidia.com/rail0_vf: 1
nvidia.com/rail1_vf: 1
nvidia.com/rail2_vf: 1
nvidia.com/rail3_vf: 1
nvidia.com/rail4_vf: 1
nvidia.com/rail5_vf: 1
nvidia.com/rail6_vf: 1
nvidia.com/rail7_vf: 1
hugepages-1Gi: 16Gi
memory: 1500Gi
securityContext:
capabilities:
add: ["IPC_LOCK", "SYS_NICE"]
env:
- name: NCCL_IB_HCA
value: "mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_7"
- name: NCCL_IB_GID_INDEX
value: "3"
- name: NCCL_IB_QPS_PER_CONNECTION
value: "4"
- name: NCCL_SOCKET_IFNAME
value: "eth0"
The two things people miss:
IPC_LOCKcapability — required for RDMA memory pinning. Without it,ibv_reg_mrfails inside the container.- NCCL env vars — without
NCCL_IB_HCA, NCCL only uses the first NIC it finds. With it, you get all 8.
Tune DCQCN on each host
Once the cluster is up, enable DCQCN on the NICs. This is a one-time per-host config:
mlnx_qos -i enp1s0 --trust dscp
echo 1 > /sys/class/net/enp1s0/ecn/roce_np/enable/3
echo 1 > /sys/class/net/enp1s0/ecn/roce_rp/enable/3
# ... repeat for all 8 NIC interfaces
This tells the NIC to:
- Trust DSCP for QoS classification (match the switch config)
- Enable Notification Point (generate CNPs on incoming CE-marked packets)
- Enable Reaction Point (react to CNPs by dialing back send rate)
Both NP and RP must be enabled — every NIC is both a sender and receiver.
What you should remember
- The chain has one direction: BIOS → kernel → drivers → operators → Multus → CNI → NAD → pod. Don't skip steps.
IPC_LOCKcapability in the pod is the #1 forgotten setting.- NCCL env vars matter — without
NCCL_IB_HCAlisting all 8 NICs, NCCL uses one and you lose 87.5% of bandwidth. - DCQCN must be enabled on every NIC — the switch config alone isn't enough.
- Capture a working pod spec as a template. Most ops debugging is "diff this against the known-good spec."
Next: Validate & Run the First Training Job → — the validation pyramid. Prove the cluster works at every layer before trusting it.