Inside the Libraries — NCCL, UCX & SHARP
Three libraries carry almost all production AI traffic. Here's what each one does, and the knobs you'll actually turn.
- Read NCCL's init — topology detection, ranks and communicators, and the ring-vs-tree algorithm choice.
- Reach for the right
NCCL_*knob — HCA selection, interface pinning, GDR level, algorithm, buffer size. - Say what UCX and SHARP add — transport auto-selection, and AllReduce offloaded onto the switch itself.
NCCL — the one you'll almost certainly run
NCCL is the most widely deployed GPU communication library in production AI. If you're training on NVIDIA GPUs — A100, H100, B200 — you're using it.
At initialisation it:
- Auto-detects the physical topology: NVLink > PCIe > InfiniBand > RoCE > TCP (fallback).
- Assigns each GPU a rank in a communicator group — like a router ID inside an OSPF area.
- Picks an algorithm per collective: ring for AllReduce on large tensors, tree for small tensors or very high node counts.
The knobs worth knowing
| Environment variable | Effect |
|---|---|
NCCL_DEBUG=INFO | Verbose logging — the debug ip ospf events of NCCL |
NCCL_IB_HCA=mlx5_0 | Pin a specific InfiniBand HCA |
NCCL_SOCKET_IFNAME=eth0 | Force the control path onto a specific NIC |
NCCL_ALGO | Force the algorithm — Ring or Tree |
NCCL_TOPO_FILE=/path/topo.xml | Supply a custom topology (like static routes) |
NCCL_NET_GDR_LEVEL=5 | GPUDirect RDMA aggressiveness |
NCCL_BUFFSIZE=4194304 | Communication buffer size — tune latency vs throughput |
NCCL_* nameOn AMD, RCCL is ABI-compatible with NCCL and reads the exact same NCCL_* variables.
NCCL_DEBUG, NCCL_IB_HCA, NCCL_IB_GID_INDEX all work unchanged — your RoCE tuning carries straight over.
Intel's oneCCL is the exception: it uses the CCL_* namespace instead.
NCCL_NET_GDR_LEVEL reaches into GPUDirect RDMA — the host-side config that lets the NIC DMA straight into GPU memory.
UCX — the transport picker underneath
UCX is the transport-abstraction layer beneath MPI, and increasingly beneath NCCL. It automatically selects the fastest available transport for each pair of endpoints:
- Shared memory — same node, fastest
- GPUDirect RDMA — NVLink or PCIe, within a node
- InfiniBand Verbs — cross-node, near line rate
- RoCE / libfabric — cross-node, Ethernet-based
- TCP/IP — fallback, slow path
UCX behaves like CEF with several adjacency types in its FIB. It pre-computes the best reachability for each destination (every other GPU) and stores the transport method. When data needs to move it punts to the right transport immediately — no per-packet lookup. Swap the fabric (IB → RoCE) and it re-evaluates transparently, like a CEF adjacency change.
SHARP — AllReduce on the switch
SHARP (Scalable Hierarchical Aggregation and Reduction Protocol) is a feature of NVIDIA/Mellanox InfiniBand switches that performs the AllReduce sum on the switch ASIC, not on the GPU.
- Instead of data hopping GPU → switch → GPU → switch → GPU, the switch reduces in place.
- It receives gradient chunks from every connected GPU, sums them on-chip, and returns the result in a single pass.
- That removes roughly 2× the data volume from the fabric for every AllReduce.
SHARP is like moving routing-table computation from the route processor to the line-card ASIC. Instead of punting every packet to the RP for a lookup, the line card decides at wire speed. SHARP moves the AllReduce sum from the GPU (the RP) to the IB switch ASIC (the line card) — latency drops, and the GPU is freed for actual model compute.
💡 What you should remember
| # | Concept | Why it matters | |
|---|---|---|---|
| 1 | 🧭 | NCCL auto-detects topology and ranks at init | NVLink > PCIe > IB > RoCE > TCP, chosen for you |
| 2 | 🎛️ | NCCL_* is your config surface | HCA, interface, algorithm, GDR level, buffer size |
| 3 | 🧱 | UCX picks the transport; SHARP offloads the sum | One abstracts the fabric; the other moves AllReduce onto the switch |
Next: Topology Awareness → — how NCCL treats NVLink, PCIe, and InfiniBand like routes with different administrative distances, and why it builds rack-local rings on a fat-tree.