NCCL and GPUDirect Configuration

You've configured the NICs, the QoS, the SR-IOV, the Multus attachments. The pod is running. The training script imports PyTorch and calls torch.distributed.all_reduce. Whether that AllReduce runs at 350 Gbps or 50 Gbps comes down to two things:

NCCL picking the right NICs — and especially, the right NIC for each GPU.
GPUDirect RDMA being active — so the NIC writes directly to GPU memory without bouncing through CPU DRAM.

This page is the host-side mile that gets the application to actually see the fabric you built.

GPUDirect RDMA comparison. Left panel WITHOUT GPUDirect: GPU HBM data bounces through host DRAM via the CPU before the NIC DMAs it to the wire. Two PCIe traversals per direction, ~200 Gbps on a 400 Gbps NIC, CPU caches polluted. Right panel WITH GPUDirect: NIC DMAs straight from GPU HBM over PCIe peer-to-peer, CPU and host DRAM completely bypassed, ~395 Gbps line rate, requires nvidia_peermem kernel module loaded. — Same hardware, half or full throughput — depending on whether `nvidia_peermem` is loaded and PCIe peer-to-peer is functional between GPU and NIC.

What NCCL is

NCCL (NVIDIA Collective Communications Library) is the GPU-collective library that PyTorch, JAX, and most training frameworks use under the hood. When you call all_reduce(), NCCL is what plans the algorithm (ring vs tree), picks the NICs to use, and posts the RDMA WRITEs.

For an 8-GPU server in a rail-optimized cluster, NCCL needs to know:

Which NICs are available (8 of them)
Which NIC belongs to which rail
Which NIC is closest (PCIe-wise) to which GPU
How to map "I want to talk to peer rank N" to "I should use NIC M to talk to peer N's NIC M"

Without telling it, NCCL guesses — often wrongly.

GPU-to-NIC affinity

In an 8-GPU server, each GPU sits on a specific PCIe root complex / NUMA node. Each NIC also sits on one. Same root complex = direct PCIe path. Different root complex = data hops across the CPU's PCIe switches, adding latency.

In a DGX-style server:

NUMA node 0:    GPU 0   GPU 1   GPU 2   GPU 3   NIC 0   NIC 1   NIC 2   NIC 3
NUMA node 1:    GPU 4   GPU 5   GPU 6   GPU 7   NIC 4   NIC 5   NIC 6   NIC 7

GPU 0 should use NIC 0 (same root complex, lowest-latency PCIe path). GPU 4 should use NIC 4. Cross-NUMA pairings (e.g., GPU 0 ↔ NIC 5) hop through QPI / Infinity Fabric and cost latency.

You tell NCCL this affinity via:

export NCCL_IB_GID_INDEX=3              # which GID (IP address) to use on the VF
export NCCL_IB_HCA=mlx5_0,mlx5_1,...    # explicit list of NICs to use
export NCCL_TOPO_FILE=/etc/nccl/topo.xml  # custom topology file

The NCCL_TOPO_FILE is a vendor-provided XML that tells NCCL exactly which NIC pairs with which GPU. NVIDIA publishes these for DGX boxes; for custom HGX / MGX systems you derive it from nvidia-smi topo -m output.

GPUDirect RDMA — the zero-copy path

Without GPUDirect, when a NIC sends data from GPU memory, the path is:

GPU HBM → PCIe → CPU DRAM (bounce buffer) → PCIe → NIC → wire

The CPU DRAM bounce doubles the PCIe traffic and adds 5–10 μs of latency per message. At 400 Gbps, this is a serious bottleneck.

With GPUDirect RDMA enabled, the NIC and GPU talk over PCIe directly:

GPU HBM → PCIe → NIC → wire

No CPU DRAM bounce. No CPU involvement. The data goes straight from one accelerator to another.

For GPUDirect to work, you need:

NVIDIA driver with GPUDirect support (default in recent drivers).
NIC and GPU on the same PCIe root complex (or at least within a switch that supports peer-to-peer DMA).
nvidia_peermem kernel module loaded — bridges the NVIDIA driver to the RDMA core.
NCCL configured to use it — usually automatic if everything else is in place.

Verify:

lsmod | grep nvidia_peermem      # should appear
modinfo nvidia_peermem           # check the module exists

Without GPUDirect, you'll see roughly half the bandwidth NCCL should achieve. It's a 2× difference, immediately observable in nccl-tests results.

The variables that matter

The NCCL_* environment variables are how you tell NCCL what to do. The shortlist for rail-optimized RoCE v2:

Variable	Typical value	What it does
`NCCL_DEBUG`	`INFO` (debug) / `WARN` (prod)	Verbosity
`NCCL_IB_HCA`	`mlx5_0,mlx5_1,...,mlx5_7`	NICs to use
`NCCL_IB_GID_INDEX`	`3` (for RoCE v2 RDMA-CM)	Which GID (IP) per NIC
`NCCL_IB_TIMEOUT`	`22` (default)	RDMA timeout in 2^n seconds
`NCCL_IB_RETRY_CNT`	`7`	RDMA retry count
`NCCL_IB_QPS_PER_CONNECTION`	`4`	Parallel QPs per peer (for hash entropy)
`NCCL_SOCKET_IFNAME`	`eth0`	Interface for bootstrap (NOT data)
`NCCL_IB_SL`	`3`	Service level / priority for RDMA traffic
`NCCL_TOPO_FILE`	`/etc/nccl/topo.xml`	Topology file path
`NCCL_NET_GDR_LEVEL`	`5` (PHB)	GPUDirect aggression level
`NCCL_P2P_LEVEL`	`NVL`	NVLink-only for intra-node GPU↔GPU

Most NVIDIA reference container images set sensible defaults. For a custom setup, copy from NVIDIA's nccl-tests README and tune from there.

How to verify it's working

Three checks:

1. `nccl-tests/all_reduce_perf`

NVIDIA's reference benchmark. Expected throughput on an 8-GPU 8-NIC RoCE v2 node:

Message size	Expected algbw
1 MB	~50 GB/s (limited by latency)
64 MB	~200 GB/s
1 GB	~300+ GB/s (close to NIC line rate × 8)

If you're seeing half these numbers, GPUDirect probably isn't active.

2. `ib_write_bw` between two pods

Pod A:

ib_write_bw -d mlx5_0 --use_cuda=0 -s 65536 -n 10000

Pod B (connect to A):

ib_write_bw -d mlx5_0 --use_cuda=0 -s 65536 -n 10000 <pod-a-ip>

Should hit ~95% of line rate on a single rail. If not, NIC config or QoS is off.

3. NCCL debug output

export NCCL_DEBUG=INFO

Look for:

NET/IB: Using ... lines (should list 8 HCAs)
NET/Plugin: Using internal P2P plugin (good — using RDMA)
using NCCL_P2P_LEVEL=NVL (good — intra-node uses NVLink)
No WARN or WARN: skipping ... messages

What can go wrong

The classic failures and their causes:

Symptom	Cause
AllReduce uses only 1 NIC	`NCCL_IB_HCA` not set; NCCL discovered 1 NIC
AllReduce throughput is half expected	GPUDirect inactive — check `nvidia_peermem` module
Some pairs are slow, others fast	NUMA topology mismatch; check topo.xml
Random hangs after many iterations	`NCCL_IB_TIMEOUT` too short for the actual network RTT
"NCCL ... unhandled cuda error"	GPU driver / CUDA / NCCL version mismatch
Bootstrap fails	`NCCL_SOCKET_IFNAME` pointing at the wrong (RDMA) interface

What you should remember

NCCL is what your training framework actually uses for collectives. Configure it right or training is slow.
GPU-to-NIC affinity matters — each GPU should pair with the NIC on its NUMA node. Get this wrong, lose latency.
GPUDirect RDMA is mandatory at 400 G+ — without it, throughput halves. Verify nvidia_peermem is loaded.
NCCL_IB_HCA is the most important env var — tells NCCL which NICs to use.
NCCL_SOCKET_IFNAME is for bootstrap only — don't accidentally point it at the RDMA interface.
Always verify with nccl-tests before training. Catch performance issues early.

Next: more sections — Inference Networking, Production Operations. For now, head back to the curriculum index or revisit Switch QoS to align with how the host configures DSCP/SL.

What NCCL is​

GPU-to-NIC affinity​

GPUDirect RDMA — the zero-copy path​

The variables that matter​

How to verify it's working​

1. nccl-tests/all_reduce_perf​

2. ib_write_bw between two pods​

3. NCCL debug output​

What can go wrong​

What you should remember​