Inside the Libraries — NCCL, UCX & SHARP

Three libraries carry almost all production AI traffic. Here's what each one does, and the knobs you'll actually turn.

After this page, you'll be able to

Read NCCL's init — topology detection, ranks and communicators, and the ring-vs-tree algorithm choice.
Reach for the right NCCL_* knob — HCA selection, interface pinning, GDR level, algorithm, buffer size.
Say what UCX and SHARP add — transport auto-selection, and AllReduce offloaded onto the switch itself.

NCCL — the one you'll almost certainly run

NCCL is the most widely deployed GPU communication library in production AI. If you're training on NVIDIA GPUs — A100, H100, B200 — you're using it.

At initialisation it:

Auto-detects the physical topology: NVLink > PCIe > InfiniBand > RoCE > TCP (fallback).
Assigns each GPU a rank in a communicator group — like a router ID inside an OSPF area.
Picks an algorithm per collective: ring for AllReduce on large tensors, tree for small tensors or very high node counts.

The knobs worth knowing

Environment variable	Effect
`NCCL_DEBUG=INFO`	Verbose logging — the `debug ip ospf events` of NCCL
`NCCL_IB_HCA=mlx5_0`	Pin a specific InfiniBand HCA
`NCCL_SOCKET_IFNAME=eth0`	Force the control path onto a specific NIC
`NCCL_ALGO`	Force the algorithm — `Ring` or `Tree`
`NCCL_TOPO_FILE=/path/topo.xml`	Supply a custom topology (like static routes)
`NCCL_NET_GDR_LEVEL=5`	GPUDirect RDMA aggressiveness
`NCCL_BUFFSIZE=4194304`	Communication buffer size — tune latency vs throughput

RCCL reuses every NCCL_* name

On AMD, RCCL is ABI-compatible with NCCL and reads the exact same NCCL_* variables. NCCL_DEBUG, NCCL_IB_HCA, NCCL_IB_GID_INDEX all work unchanged — your RoCE tuning carries straight over. Intel's oneCCL is the exception: it uses the CCL_* namespace instead.

NCCL_NET_GDR_LEVEL reaches into GPUDirect RDMA — the host-side config that lets the NIC DMA straight into GPU memory.

UCX — the transport picker underneath

UCX is the transport-abstraction layer beneath MPI, and increasingly beneath NCCL. It automatically selects the fastest available transport for each pair of endpoints:

Shared memory — same node, fastest
GPUDirect RDMA — NVLink or PCIe, within a node
InfiniBand Verbs — cross-node, near line rate
RoCE / libfabric — cross-node, Ethernet-based
TCP/IP — fallback, slow path

📡 Analogy — CEF with multiple adjacency types

UCX behaves like CEF with several adjacency types in its FIB. It pre-computes the best reachability for each destination (every other GPU) and stores the transport method. When data needs to move it punts to the right transport immediately — no per-packet lookup. Swap the fabric (IB → RoCE) and it re-evaluates transparently, like a CEF adjacency change.

SHARP — AllReduce on the switch

SHARP (Scalable Hierarchical Aggregation and Reduction Protocol) is a feature of NVIDIA/Mellanox InfiniBand switches that performs the AllReduce sum on the switch ASIC, not on the GPU.

Instead of data hopping GPU → switch → GPU → switch → GPU, the switch reduces in place.
It receives gradient chunks from every connected GPU, sums them on-chip, and returns the result in a single pass.
That removes roughly 2× the data volume from the fabric for every AllReduce.

📡 Analogy — moving compute to the line card

SHARP is like moving routing-table computation from the route processor to the line-card ASIC. Instead of punting every packet to the RP for a lookup, the line card decides at wire speed. SHARP moves the AllReduce sum from the GPU (the RP) to the IB switch ASIC (the line card) — latency drops, and the GPU is freed for actual model compute.

💡 What you should remember

#		Concept	Why it matters
1	🧭	NCCL auto-detects topology and ranks at init	NVLink > PCIe > IB > RoCE > TCP, chosen for you
2	🎛️	*`NCCL_` is your config surface**	HCA, interface, algorithm, GDR level, buffer size
3	🧱	UCX picks the transport; SHARP offloads the sum	One abstracts the fabric; the other moves AllReduce onto the switch

Next: Topology Awareness → — how NCCL treats NVLink, PCIe, and InfiniBand like routes with different administrative distances, and why it builds rack-local rings on a fat-tree.

NCCL — the one you'll almost certainly run​

The knobs worth knowing​

UCX — the transport picker underneath​

SHARP — AllReduce on the switch​

💡 What you should remember​

NCCL — the one you'll almost certainly run

The knobs worth knowing

UCX — the transport picker underneath

SHARP — AllReduce on the switch

💡 What you should remember