Skip to main content

Inside the Libraries — NCCL, UCX & SHARP

Three libraries carry almost all production AI traffic. Here's what each one does, and the knobs you'll actually turn.

After this page, you'll be able to
  1. Read NCCL's init — topology detection, ranks and communicators, and the ring-vs-tree algorithm choice.
  2. Reach for the right NCCL_* knob — HCA selection, interface pinning, GDR level, algorithm, buffer size.
  3. Say what UCX and SHARP add — transport auto-selection, and AllReduce offloaded onto the switch itself.

NCCL — the one you'll almost certainly run

NCCL is the most widely deployed GPU communication library in production AI. If you're training on NVIDIA GPUs — A100, H100, B200 — you're using it.

At initialisation it:

  • Auto-detects the physical topology: NVLink > PCIe > InfiniBand > RoCE > TCP (fallback).
  • Assigns each GPU a rank in a communicator group — like a router ID inside an OSPF area.
  • Picks an algorithm per collective: ring for AllReduce on large tensors, tree for small tensors or very high node counts.

The knobs worth knowing

Environment variableEffect
NCCL_DEBUG=INFOVerbose logging — the debug ip ospf events of NCCL
NCCL_IB_HCA=mlx5_0Pin a specific InfiniBand HCA
NCCL_SOCKET_IFNAME=eth0Force the control path onto a specific NIC
NCCL_ALGOForce the algorithm — Ring or Tree
NCCL_TOPO_FILE=/path/topo.xmlSupply a custom topology (like static routes)
NCCL_NET_GDR_LEVEL=5GPUDirect RDMA aggressiveness
NCCL_BUFFSIZE=4194304Communication buffer size — tune latency vs throughput
RCCL reuses every NCCL_* name

On AMD, RCCL is ABI-compatible with NCCL and reads the exact same NCCL_* variables. NCCL_DEBUG, NCCL_IB_HCA, NCCL_IB_GID_INDEX all work unchanged — your RoCE tuning carries straight over. Intel's oneCCL is the exception: it uses the CCL_* namespace instead.

NCCL_NET_GDR_LEVEL reaches into GPUDirect RDMA — the host-side config that lets the NIC DMA straight into GPU memory.


UCX — the transport picker underneath

UCX is the transport-abstraction layer beneath MPI, and increasingly beneath NCCL. It automatically selects the fastest available transport for each pair of endpoints:

  • Shared memory — same node, fastest
  • GPUDirect RDMA — NVLink or PCIe, within a node
  • InfiniBand Verbs — cross-node, near line rate
  • RoCE / libfabric — cross-node, Ethernet-based
  • TCP/IP — fallback, slow path
📡 Analogy — CEF with multiple adjacency types

UCX behaves like CEF with several adjacency types in its FIB. It pre-computes the best reachability for each destination (every other GPU) and stores the transport method. When data needs to move it punts to the right transport immediately — no per-packet lookup. Swap the fabric (IB → RoCE) and it re-evaluates transparently, like a CEF adjacency change.


SHARP — AllReduce on the switch

SHARP (Scalable Hierarchical Aggregation and Reduction Protocol) is a feature of NVIDIA/Mellanox InfiniBand switches that performs the AllReduce sum on the switch ASIC, not on the GPU.

  • Instead of data hopping GPU → switch → GPU → switch → GPU, the switch reduces in place.
  • It receives gradient chunks from every connected GPU, sums them on-chip, and returns the result in a single pass.
  • That removes roughly 2× the data volume from the fabric for every AllReduce.
📡 Analogy — moving compute to the line card

SHARP is like moving routing-table computation from the route processor to the line-card ASIC. Instead of punting every packet to the RP for a lookup, the line card decides at wire speed. SHARP moves the AllReduce sum from the GPU (the RP) to the IB switch ASIC (the line card) — latency drops, and the GPU is freed for actual model compute.


💡 What you should remember

#ConceptWhy it matters
1🧭NCCL auto-detects topology and ranks at initNVLink > PCIe > IB > RoCE > TCP, chosen for you
2🎛️NCCL_* is your config surfaceHCA, interface, algorithm, GDR level, buffer size
3🧱UCX picks the transport; SHARP offloads the sumOne abstracts the fabric; the other moves AllReduce onto the switch

Next: Topology Awareness → — how NCCL treats NVLink, PCIe, and InfiniBand like routes with different administrative distances, and why it builds rack-local rings on a fat-tree.