The Library Landscape
"Communication library" isn't one thing. It's four layers of them, stacked.
- Tell the four categories apart — GPU collective libraries, MPI, transport-abstraction layers, and framework-native APIs.
- Place the big names — where NCCL, MPI, UCX, and PyTorch Distributed sit relative to each other.
- Map it onto OSI — layers 1–7 for AI comms, and why layer 2 has to be lossless.
1. GPU collective libraries — the workhorse
Purpose-built for GPU-to-GPU communication during training. They implement the collective operations (AllReduce, Broadcast, AllGather…) and are tightly coupled to GPU memory and the interconnect.
| Library | Vendor / target | Key characteristic |
|---|---|---|
| NCCL | NVIDIA / CUDA GPUs | Ring AllReduce, NVLink-aware, dominant in production |
| RCCL | AMD / ROCm GPUs | NCCL-compatible API, MI300X-optimised |
| oneCCL | Intel / Gaudi, CPU | Intel-MPI-based, Gaudi NIC integration |
| MSCCL | Microsoft (NCCL fork) | Custom topology schedules, Azure HPC clusters |
NCCL is the Cisco IOS of GPU clusters: dominant, battle-tested, deeply hardware-integrated. RCCL is the Junos equivalent — compatible API (think NETCONF), different silicon underneath. oneCCL is the open-source VyOS: flexible, vendor-neutral, not always the fastest.
2. MPI — the standard
The Message Passing Interface is the original HPC communication standard, dating to 1994. It defines a universal API for point-to-point and collective communication between processes — on CPUs, across nodes, over any fabric.
- Think of MPI as the IETF of HPC comms: it defines the standard; implementations (OpenMPI, MPICH, Intel MPI) provide the code.
- MPI is process-centric — each process gets a rank (like a router ID), and you address messages to ranks.
- In modern AI it runs alongside NCCL: MPI handles process launch and CPU-side coordination; NCCL moves the GPU data.
MPI is your control plane (BGP): it establishes the topology, assigns ranks, coordinates barriers. NCCL is your data plane (MPLS): it actually moves the data at line rate. In a serious cluster you run both — you wouldn't run one without the other.
3. Transport-abstraction libraries
These sit below the collective libraries and abstract over the hardware fabric. Write one piece of code; run it over InfiniBand, RoCE, TCP, or shared memory without rewriting the transport logic.
| Library | What it abstracts |
|---|---|
| UCX (Unified Communication X) | IB Verbs, RoCE, TCP, shared memory — one API for all |
| libfabric (OFI) | OpenFabrics Interface; Slingshot, EFA, PSM2, TCP |
| Gloo (Meta) | CPU collectives over TCP/Ethernet; PyTorch's fallback |
| SHARP (NVIDIA/Mellanox) | In-network compute — AllReduce on the IB switch itself |
UCX is to HPC networking what OpenConfig is to network management: a vendor-neutral abstraction layer. Just as OpenConfig configures a Juniper and a Cisco with one data model, UCX lets NCCL or MPI ride InfiniBand, RoCE, or plain TCP through the same API call.
4. Framework-native communication
Modern AI frameworks ship their own communication abstractions that sit on top of NCCL or MPI.
| Framework / component | What it does |
|---|---|
PyTorch Distributed (torch.distributed) | Python API for collectives; backend = NCCL, Gloo, or MPI |
| DeepSpeed ZeRO | Shards optimiser states + gradients; uses NCCL underneath |
| Megatron-LM | Tensor + pipeline parallelism; custom collective scheduling on NCCL |
JAX pjit / shard_map | XLA compiler-level distribution; NCCL/RCCL via XLA collective ops |
This is the layer the ML engineer touches. Everything below it is the layer you touch.
The whole stack, mapped to OSI
If you think in layers, here's the exact OSI equivalent for AI/HPC communication.
InfiniBand and RoCEv2 demand a lossless fabric because RDMA has no traditional retransmission buffer. A single dropped packet stalls the entire Queue Pair — TCP head-of-line blocking, but with no recovery path. That's why you configure PFC on RoCE and why IB uses credit-based flow control at layer 2. The library assumes the fabric is lossless. Making that assumption hold is the network engineer's job.
💡 What you should remember
| # | Concept | Why it matters | |
|---|---|---|---|
| 1 | 🏭 | GPU collective libraries are the workhorse | NCCL / RCCL / oneCCL — same API family, different silicon |
| 2 | 🔀 | MPI + NCCL = control plane + data plane | MPI launches and coordinates; NCCL moves the bytes |
| 3 | 🧱 | UCX / libfabric abstract the fabric | One API over IB, RoCE, TCP, or shared memory |
| 4 | 📐 | The collective library lives at OSI 5–6 | Transport (4) is RDMA; the lossless job lives at layer 2 |
Next: Collective Operations, the Routing Protocols of AI → — AllReduce, Broadcast, AllGather, ReduceScatter, AllToAll and friends, each mapped to a protocol you already run.