Skip to main content

NCCL and GPUDirect Configuration

You've configured the NICs, the QoS, the SR-IOV, the Multus attachments. The pod is running. The training script imports PyTorch and calls torch.distributed.all_reduce. Whether that AllReduce runs at 350 Gbps or 50 Gbps comes down to two things:

  1. NCCL picking the right NICs — and especially, the right NIC for each GPU.
  2. GPUDirect RDMA being active — so the NIC writes directly to GPU memory without bouncing through CPU DRAM.

This page is the host-side mile that gets the application to actually see the fabric you built.

GPUDirect RDMA comparison. Left panel WITHOUT GPUDirect: GPU HBM data bounces through host DRAM via the CPU before the NIC DMAs it to the wire. Two PCIe traversals per direction, ~200 Gbps on a 400 Gbps NIC, CPU caches polluted. Right panel WITH GPUDirect: NIC DMAs straight from GPU HBM over PCIe peer-to-peer, CPU and host DRAM completely bypassed, ~395 Gbps line rate, requires nvidia_peermem kernel module loaded.
Same hardware, half or full throughput — depending on whether `nvidia_peermem` is loaded and PCIe peer-to-peer is functional between GPU and NIC.

What NCCL is

NCCL (NVIDIA Collective Communications Library) is the GPU-collective library that PyTorch, JAX, and most training frameworks use under the hood. When you call all_reduce(), NCCL is what plans the algorithm (ring vs tree), picks the NICs to use, and posts the RDMA WRITEs.

For an 8-GPU server in a rail-optimized cluster, NCCL needs to know:

  • Which NICs are available (8 of them)
  • Which NIC belongs to which rail
  • Which NIC is closest (PCIe-wise) to which GPU
  • How to map "I want to talk to peer rank N" to "I should use NIC M to talk to peer N's NIC M"

Without telling it, NCCL guesses — often wrongly.


GPU-to-NIC affinity

In an 8-GPU server, each GPU sits on a specific PCIe root complex / NUMA node. Each NIC also sits on one. Same root complex = direct PCIe path. Different root complex = data hops across the CPU's PCIe switches, adding latency.

In a DGX-style server:

NUMA node 0: GPU 0 GPU 1 GPU 2 GPU 3 NIC 0 NIC 1 NIC 2 NIC 3
NUMA node 1: GPU 4 GPU 5 GPU 6 GPU 7 NIC 4 NIC 5 NIC 6 NIC 7

GPU 0 should use NIC 0 (same root complex, lowest-latency PCIe path). GPU 4 should use NIC 4. Cross-NUMA pairings (e.g., GPU 0 ↔ NIC 5) hop through QPI / Infinity Fabric and cost latency.

You tell NCCL this affinity via:

export NCCL_IB_GID_INDEX=3 # which GID (IP address) to use on the VF
export NCCL_IB_HCA=mlx5_0,mlx5_1,... # explicit list of NICs to use
export NCCL_TOPO_FILE=/etc/nccl/topo.xml # custom topology file

The NCCL_TOPO_FILE is a vendor-provided XML that tells NCCL exactly which NIC pairs with which GPU. NVIDIA publishes these for DGX boxes; for custom HGX / MGX systems you derive it from nvidia-smi topo -m output.


GPUDirect RDMA — the zero-copy path

Without GPUDirect, when a NIC sends data from GPU memory, the path is:

GPU HBM → PCIe → CPU DRAM (bounce buffer) → PCIe → NIC → wire

The CPU DRAM bounce doubles the PCIe traffic and adds 5–10 μs of latency per message. At 400 Gbps, this is a serious bottleneck.

With GPUDirect RDMA enabled, the NIC and GPU talk over PCIe directly:

GPU HBM → PCIe → NIC → wire

No CPU DRAM bounce. No CPU involvement. The data goes straight from one accelerator to another.

For GPUDirect to work, you need:

  1. NVIDIA driver with GPUDirect support (default in recent drivers).
  2. NIC and GPU on the same PCIe root complex (or at least within a switch that supports peer-to-peer DMA).
  3. nvidia_peermem kernel module loaded — bridges the NVIDIA driver to the RDMA core.
  4. NCCL configured to use it — usually automatic if everything else is in place.

Verify:

lsmod | grep nvidia_peermem # should appear
modinfo nvidia_peermem # check the module exists

Without GPUDirect, you'll see roughly half the bandwidth NCCL should achieve. It's a 2× difference, immediately observable in nccl-tests results.


The variables that matter

The NCCL_* environment variables are how you tell NCCL what to do. The shortlist for rail-optimized RoCE v2:

VariableTypical valueWhat it does
NCCL_DEBUGINFO (debug) / WARN (prod)Verbosity
NCCL_IB_HCAmlx5_0,mlx5_1,...,mlx5_7NICs to use
NCCL_IB_GID_INDEX3 (for RoCE v2 RDMA-CM)Which GID (IP) per NIC
NCCL_IB_TIMEOUT22 (default)RDMA timeout in 2^n seconds
NCCL_IB_RETRY_CNT7RDMA retry count
NCCL_IB_QPS_PER_CONNECTION4Parallel QPs per peer (for hash entropy)
NCCL_SOCKET_IFNAMEeth0Interface for bootstrap (NOT data)
NCCL_IB_SL3Service level / priority for RDMA traffic
NCCL_TOPO_FILE/etc/nccl/topo.xmlTopology file path
NCCL_NET_GDR_LEVEL5 (PHB)GPUDirect aggression level
NCCL_P2P_LEVELNVLNVLink-only for intra-node GPU↔GPU

Most NVIDIA reference container images set sensible defaults. For a custom setup, copy from NVIDIA's nccl-tests README and tune from there.


How to verify it's working

Three checks:

1. nccl-tests/all_reduce_perf

NVIDIA's reference benchmark. Expected throughput on an 8-GPU 8-NIC RoCE v2 node:

Message sizeExpected algbw
1 MB~50 GB/s (limited by latency)
64 MB~200 GB/s
1 GB~300+ GB/s (close to NIC line rate × 8)

If you're seeing half these numbers, GPUDirect probably isn't active.

2. ib_write_bw between two pods

Pod A:

ib_write_bw -d mlx5_0 --use_cuda=0 -s 65536 -n 10000

Pod B (connect to A):

ib_write_bw -d mlx5_0 --use_cuda=0 -s 65536 -n 10000 <pod-a-ip>

Should hit ~95% of line rate on a single rail. If not, NIC config or QoS is off.

3. NCCL debug output

export NCCL_DEBUG=INFO

Look for:

  • NET/IB: Using ... lines (should list 8 HCAs)
  • NET/Plugin: Using internal P2P plugin (good — using RDMA)
  • using NCCL_P2P_LEVEL=NVL (good — intra-node uses NVLink)
  • No WARN or WARN: skipping ... messages

What can go wrong

The classic failures and their causes:

SymptomCause
AllReduce uses only 1 NICNCCL_IB_HCA not set; NCCL discovered 1 NIC
AllReduce throughput is half expectedGPUDirect inactive — check nvidia_peermem module
Some pairs are slow, others fastNUMA topology mismatch; check topo.xml
Random hangs after many iterationsNCCL_IB_TIMEOUT too short for the actual network RTT
"NCCL ... unhandled cuda error"GPU driver / CUDA / NCCL version mismatch
Bootstrap failsNCCL_SOCKET_IFNAME pointing at the wrong (RDMA) interface

What you should remember

  • NCCL is what your training framework actually uses for collectives. Configure it right or training is slow.
  • GPU-to-NIC affinity matters — each GPU should pair with the NIC on its NUMA node. Get this wrong, lose latency.
  • GPUDirect RDMA is mandatory at 400 G+ — without it, throughput halves. Verify nvidia_peermem is loaded.
  • NCCL_IB_HCA is the most important env var — tells NCCL which NICs to use.
  • NCCL_SOCKET_IFNAME is for bootstrap only — don't accidentally point it at the RDMA interface.
  • Always verify with nccl-tests before training. Catch performance issues early.

Next: more sections — Inference Networking, Production Operations. For now, head back to the curriculum index or revisit Switch QoS to align with how the host configures DSCP/SL.