NCCL and GPUDirect Configuration
You've configured the NICs, the QoS, the SR-IOV, the Multus attachments. The pod is running. The training script imports PyTorch and calls torch.distributed.all_reduce. Whether that AllReduce runs at 350 Gbps or 50 Gbps comes down to two things:
- NCCL picking the right NICs — and especially, the right NIC for each GPU.
- GPUDirect RDMA being active — so the NIC writes directly to GPU memory without bouncing through CPU DRAM.
This page is the host-side mile that gets the application to actually see the fabric you built.
What NCCL is
NCCL (NVIDIA Collective Communications Library) is the GPU-collective library that PyTorch, JAX, and most training frameworks use under the hood. When you call all_reduce(), NCCL is what plans the algorithm (ring vs tree), picks the NICs to use, and posts the RDMA WRITEs.
For an 8-GPU server in a rail-optimized cluster, NCCL needs to know:
- Which NICs are available (8 of them)
- Which NIC belongs to which rail
- Which NIC is closest (PCIe-wise) to which GPU
- How to map "I want to talk to peer rank N" to "I should use NIC M to talk to peer N's NIC M"
Without telling it, NCCL guesses — often wrongly.
GPU-to-NIC affinity
In an 8-GPU server, each GPU sits on a specific PCIe root complex / NUMA node. Each NIC also sits on one. Same root complex = direct PCIe path. Different root complex = data hops across the CPU's PCIe switches, adding latency.
In a DGX-style server:
NUMA node 0: GPU 0 GPU 1 GPU 2 GPU 3 NIC 0 NIC 1 NIC 2 NIC 3
NUMA node 1: GPU 4 GPU 5 GPU 6 GPU 7 NIC 4 NIC 5 NIC 6 NIC 7
GPU 0 should use NIC 0 (same root complex, lowest-latency PCIe path). GPU 4 should use NIC 4. Cross-NUMA pairings (e.g., GPU 0 ↔ NIC 5) hop through QPI / Infinity Fabric and cost latency.
You tell NCCL this affinity via:
export NCCL_IB_GID_INDEX=3 # which GID (IP address) to use on the VF
export NCCL_IB_HCA=mlx5_0,mlx5_1,... # explicit list of NICs to use
export NCCL_TOPO_FILE=/etc/nccl/topo.xml # custom topology file
The NCCL_TOPO_FILE is a vendor-provided XML that tells NCCL exactly which NIC pairs with which GPU. NVIDIA publishes these for DGX boxes; for custom HGX / MGX systems you derive it from nvidia-smi topo -m output.
GPUDirect RDMA — the zero-copy path
Without GPUDirect, when a NIC sends data from GPU memory, the path is:
GPU HBM → PCIe → CPU DRAM (bounce buffer) → PCIe → NIC → wire
The CPU DRAM bounce doubles the PCIe traffic and adds 5–10 μs of latency per message. At 400 Gbps, this is a serious bottleneck.
With GPUDirect RDMA enabled, the NIC and GPU talk over PCIe directly:
GPU HBM → PCIe → NIC → wire
No CPU DRAM bounce. No CPU involvement. The data goes straight from one accelerator to another.
For GPUDirect to work, you need:
- NVIDIA driver with GPUDirect support (default in recent drivers).
- NIC and GPU on the same PCIe root complex (or at least within a switch that supports peer-to-peer DMA).
nvidia_peermemkernel module loaded — bridges the NVIDIA driver to the RDMA core.- NCCL configured to use it — usually automatic if everything else is in place.
Verify:
lsmod | grep nvidia_peermem # should appear
modinfo nvidia_peermem # check the module exists
Without GPUDirect, you'll see roughly half the bandwidth NCCL should achieve. It's a 2× difference, immediately observable in nccl-tests results.
The variables that matter
The NCCL_* environment variables are how you tell NCCL what to do. The shortlist for rail-optimized RoCE v2:
| Variable | Typical value | What it does |
|---|---|---|
NCCL_DEBUG | INFO (debug) / WARN (prod) | Verbosity |
NCCL_IB_HCA | mlx5_0,mlx5_1,...,mlx5_7 | NICs to use |
NCCL_IB_GID_INDEX | 3 (for RoCE v2 RDMA-CM) | Which GID (IP) per NIC |
NCCL_IB_TIMEOUT | 22 (default) | RDMA timeout in 2^n seconds |
NCCL_IB_RETRY_CNT | 7 | RDMA retry count |
NCCL_IB_QPS_PER_CONNECTION | 4 | Parallel QPs per peer (for hash entropy) |
NCCL_SOCKET_IFNAME | eth0 | Interface for bootstrap (NOT data) |
NCCL_IB_SL | 3 | Service level / priority for RDMA traffic |
NCCL_TOPO_FILE | /etc/nccl/topo.xml | Topology file path |
NCCL_NET_GDR_LEVEL | 5 (PHB) | GPUDirect aggression level |
NCCL_P2P_LEVEL | NVL | NVLink-only for intra-node GPU↔GPU |
Most NVIDIA reference container images set sensible defaults. For a custom setup, copy from NVIDIA's nccl-tests README and tune from there.
How to verify it's working
Three checks:
1. nccl-tests/all_reduce_perf
NVIDIA's reference benchmark. Expected throughput on an 8-GPU 8-NIC RoCE v2 node:
| Message size | Expected algbw |
|---|---|
| 1 MB | ~50 GB/s (limited by latency) |
| 64 MB | ~200 GB/s |
| 1 GB | ~300+ GB/s (close to NIC line rate × 8) |
If you're seeing half these numbers, GPUDirect probably isn't active.
2. ib_write_bw between two pods
Pod A:
ib_write_bw -d mlx5_0 --use_cuda=0 -s 65536 -n 10000
Pod B (connect to A):
ib_write_bw -d mlx5_0 --use_cuda=0 -s 65536 -n 10000 <pod-a-ip>
Should hit ~95% of line rate on a single rail. If not, NIC config or QoS is off.
3. NCCL debug output
export NCCL_DEBUG=INFO
Look for:
NET/IB: Using ...lines (should list 8 HCAs)NET/Plugin: Using internal P2P plugin(good — using RDMA)using NCCL_P2P_LEVEL=NVL(good — intra-node uses NVLink)- No
WARNorWARN: skipping ...messages
What can go wrong
The classic failures and their causes:
| Symptom | Cause |
|---|---|
| AllReduce uses only 1 NIC | NCCL_IB_HCA not set; NCCL discovered 1 NIC |
| AllReduce throughput is half expected | GPUDirect inactive — check nvidia_peermem module |
| Some pairs are slow, others fast | NUMA topology mismatch; check topo.xml |
| Random hangs after many iterations | NCCL_IB_TIMEOUT too short for the actual network RTT |
| "NCCL ... unhandled cuda error" | GPU driver / CUDA / NCCL version mismatch |
| Bootstrap fails | NCCL_SOCKET_IFNAME pointing at the wrong (RDMA) interface |
What you should remember
- NCCL is what your training framework actually uses for collectives. Configure it right or training is slow.
- GPU-to-NIC affinity matters — each GPU should pair with the NIC on its NUMA node. Get this wrong, lose latency.
- GPUDirect RDMA is mandatory at 400 G+ — without it, throughput halves. Verify
nvidia_peermemis loaded. NCCL_IB_HCAis the most important env var — tells NCCL which NICs to use.NCCL_SOCKET_IFNAMEis for bootstrap only — don't accidentally point it at the RDMA interface.- Always verify with
nccl-testsbefore training. Catch performance issues early.
Next: more sections — Inference Networking, Production Operations. For now, head back to the curriculum index or revisit Switch QoS to align with how the host configures DSCP/SL.