Kernel Tuning for RDMA
A vanilla Linux install can't do RDMA at 400 G. The pieces required for RDMA — IOMMU, hugepages, a modern NIC driver, the right kernel parameters — have to be configured deliberately. This page is the kernel side of the chain that lets a pod actually use RDMA.
The kernel command line
Edit /etc/default/grub, change GRUB_CMDLINE_LINUX, then run update-grub (Debian/Ubuntu) or grub2-mkconfig -o /boot/grub2/grub.cfg (RHEL family), then reboot.
For an AI training host on Intel:
GRUB_CMDLINE_LINUX="intel_iommu=on iommu=pt default_hugepagesz=1G hugepagesz=1G hugepages=64 isolcpus=0-31 nohz_full=0-31 rcu_nocbs=0-31"
For AMD: amd_iommu=on iommu=pt.
What each flag does:
| Flag | Why |
|---|---|
intel_iommu=on / amd_iommu=on | Enables the IOMMU driver — required for any SR-IOV VF passthrough |
iommu=pt | Passthrough mode — host PFs go direct, only VFs use IOMMU mappings (faster) |
default_hugepagesz=1G hugepagesz=1G hugepages=64 | Reserves 64 × 1 GB hugepages (64 GB) at boot for DMA-able buffers |
isolcpus=0-31 | Reserves cores 0-31 for the application — kernel scheduler won't put other work there |
nohz_full=0-31 | Disables the kernel tick on isolated cores (less jitter for tight RDMA loops) |
rcu_nocbs=0-31 | Moves RCU callbacks off isolated cores (more jitter reduction) |
Verify after reboot:
cat /proc/cmdline # confirm flags applied
dmesg | grep -i iommu # should show "IOMMU enabled"
cat /proc/meminfo | grep Huge # should show 64 1G hugepages
If /proc/meminfo shows 0 hugepages, the system couldn't allocate them at boot — usually means another flag is wrong or memory was already fragmented. Try transparent_hugepage=never as an additional flag.
IOMMU and why ACS matters
The IOMMU (I/O Memory Management Unit) is the chip-level enforcement that says "this PCIe device can DMA into this memory and only this memory." Without it, any device with DMA can write anywhere in RAM.
ACS (Access Control Services) is a PCIe feature that lets the IOMMU isolate downstream devices from each other. Without ACS:
- Multiple VFs end up in the same IOMMU group
- They can't be safely assigned to different VMs/containers
- SR-IOV effectively doesn't work for multi-tenant use
To check your server's PCIe topology:
lspci -t # show the tree
ls /sys/kernel/iommu_groups/ # list IOMMU groups
# Each VF should be in its own group, or grouped only with related PF
If multiple VFs are in one group, ACS is off — check BIOS, enable VT-d/IOMMU, enable ACS, reboot. Some BIOSes hide ACS behind "PCI Express" → "Access Control Services" or similar.
Hugepages and why
DMA at 400 G needs large contiguous physical memory regions. The default Linux 4 KB page is too small — registering a 1 GB buffer would require 250,000 mappings.
Hugepages are 2 MB or 1 GB physical pages. Reserving 64 × 1 GB at boot means the kernel won't fragment that memory, and the RDMA NIC can DMA into it with way fewer mappings.
The 1 GB hugepages need to be reserved at boot (the kernel can't reliably find 64 contiguous GBs in fragmented memory).
# Verify
cat /proc/meminfo | grep HugePages_Total # should be 64
cat /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages
For applications to use them, they have to mount or mmap from /dev/hugepages (handled by the runtime — for k8s pods, hugepages-1Gi in resources.limits).
The NIC driver
Modern RDMA NICs use the mlx5 driver family (NVIDIA / Mellanox) or irdma (Intel) or vendor-specific equivalent. Install via:
# Distro package (older but easier)
apt install rdma-core libibverbs-dev libibverbs-utils mlnx-ofed-kernel-dkms
# Or NVIDIA's bundled OFED (recommended for production)
./mlnxofedinstall --upstream-libs --dpdk
Or — and this is the recommended path for k8s — let the NVIDIA Network Operator install the driver via a DaemonSet. This is what production clusters use because driver-versus-kernel-version compatibility is a pain to manage by hand.
After install, verify:
lsmod | grep -E "mlx5|ib_" # modules loaded
rdma link # RDMA devices visible
ibstat # per-device state
ibv_devinfo # detailed per-device info
Each port should appear as a mlx5_N RDMA device, link state ACTIVE.
SR-IOV — creating VFs
Once the driver is loaded and you've confirmed PFs are working, create VFs:
# 32 VFs on the first NIC port
echo 32 > /sys/class/net/enp1s0/device/sriov_numvfs
# Verify
ls /sys/class/net/enp1s0/device/virtfn* # one entry per VF
lspci | grep -i "Mellanox\|Connect" # PFs + VFs visible
Persistence: the sriov_numvfs setting doesn't survive reboot. You need a systemd unit, udev rule, or boot script:
# /etc/systemd/system/sriov-numvfs.service
[Unit]
Description=Configure SR-IOV VFs
After=network-pre.target
[Service]
Type=oneshot
ExecStart=/bin/bash -c 'echo 32 > /sys/class/net/enp1s0/device/sriov_numvfs'
ExecStart=/bin/bash -c 'echo 32 > /sys/class/net/enp2s0/device/sriov_numvfs'
# ... repeat for all 8 NICs
[Install]
WantedBy=multi-user.target
In a k8s cluster with the SR-IOV Network Operator, this is handled declaratively via a SriovNetworkNodePolicy.
DCQCN — the RDMA congestion-control config
DCQCN runs on the NIC, but you enable it via sysfs from Linux:
# Trust DSCP (so the NIC reads QoS from incoming packets)
mlnx_qos -i enp1s0 --trust dscp
# Enable Notification Point (generate CNPs for incoming CE-marked packets)
echo 1 > /sys/class/net/enp1s0/ecn/roce_np/enable/3
# Enable Reaction Point (react to CNPs by dialing back rate)
echo 1 > /sys/class/net/enp1s0/ecn/roce_rp/enable/3
# Repeat for every NIC interface
Without DCQCN enabled, PFC has to do all the work — and that means head-of-line blocking the moment any flow exceeds capacity.
Sysctl knobs worth knowing
A few /etc/sysctl.conf settings that AI hosts often tune:
# Buffer sizes — bigger for high-bandwidth
net.core.rmem_max = 268435456
net.core.wmem_max = 268435456
net.core.rmem_default = 16777216
net.core.wmem_default = 16777216
net.core.netdev_max_backlog = 250000
# Don't reduce performance on background traffic
net.core.somaxconn = 65535
# TCP — only matters for the eth0 mgmt path
net.ipv4.tcp_window_scaling = 1
net.ipv4.tcp_timestamps = 1
net.ipv4.tcp_congestion_control = bbr
# Routing — multiple rails
net.ipv4.conf.all.rp_filter = 2 # loose mode (not strict) for multi-rail
Apply with sysctl -p. Most of these don't affect RDMA performance directly (RDMA bypasses the kernel stack) but they do affect the eth0 management path and any TCP traffic.
NUMA awareness
Modern AI servers have multiple NUMA nodes (typically 2). GPUs and NICs are wired to specific NUMA nodes. Cross-NUMA RDMA pays a latency tax and reduces throughput.
# Show NUMA topology
numactl --hardware
# Show which NUMA each NIC is on
cat /sys/class/net/enp1s0/device/numa_node
# Show which NUMA each GPU is on
nvidia-smi topo -m
In Kubernetes, the Topology Manager enforces GPU + NIC on the same NUMA node when scheduling pods. Set topologyManagerPolicy: single-numa-node in kubelet config and cpuManagerPolicy: static.
Verifying everything is alive
After all the kernel-side config, run this checklist:
# 1. IOMMU enabled
dmesg | grep -i iommu | head -5
# 2. Hugepages reserved
grep Huge /proc/meminfo
# 3. RDMA driver loaded
lsmod | grep -E "mlx5|ib_"
# 4. RDMA devices visible
rdma link
# 5. VFs created
lspci | grep -i mellanox | head -10
# 6. DCQCN enabled
cat /sys/class/net/enp1s0/ecn/roce_rp/enable/3 # should be 1
# 7. NIC at expected speed
ethtool enp1s0 | grep Speed
Every line should produce a sensible output. If any is empty or wrong, fix that before moving on.
What you should remember
- Kernel cmdline is the foundation.
iommu=on,iommu=pt, hugepages — all required. - IOMMU + ACS = safe SR-IOV. Without ACS, VFs aren't isolated.
- 64 × 1 GB hugepages = reserved at boot for DMA-able RDMA buffers.
- The NIC driver chain (mlx5_core, mlx5_ib, rdma_cm, ib_uverbs) has to be loaded for RDMA to work.
- DCQCN is enabled on every NIC via sysfs — don't forget this in production.
- NUMA awareness matters — GPU + NIC on the same NUMA, enforced by k8s Topology Manager.
Next: Debugging Tools You'll Use → — the ten Linux commands that 90% of host-side debugging boils down to.