Skip to main content

The Library Landscape

"Communication library" isn't one thing. It's four layers of them, stacked.

After this page, you'll be able to
  1. Tell the four categories apart — GPU collective libraries, MPI, transport-abstraction layers, and framework-native APIs.
  2. Place the big names — where NCCL, MPI, UCX, and PyTorch Distributed sit relative to each other.
  3. Map it onto OSI — layers 1–7 for AI comms, and why layer 2 has to be lossless.

1. GPU collective libraries — the workhorse

Purpose-built for GPU-to-GPU communication during training. They implement the collective operations (AllReduce, Broadcast, AllGather…) and are tightly coupled to GPU memory and the interconnect.

LibraryVendor / targetKey characteristic
NCCLNVIDIA / CUDA GPUsRing AllReduce, NVLink-aware, dominant in production
RCCLAMD / ROCm GPUsNCCL-compatible API, MI300X-optimised
oneCCLIntel / Gaudi, CPUIntel-MPI-based, Gaudi NIC integration
MSCCLMicrosoft (NCCL fork)Custom topology schedules, Azure HPC clusters
📡 Analogy — vendor routing stacks

NCCL is the Cisco IOS of GPU clusters: dominant, battle-tested, deeply hardware-integrated. RCCL is the Junos equivalent — compatible API (think NETCONF), different silicon underneath. oneCCL is the open-source VyOS: flexible, vendor-neutral, not always the fastest.


2. MPI — the standard

The Message Passing Interface is the original HPC communication standard, dating to 1994. It defines a universal API for point-to-point and collective communication between processes — on CPUs, across nodes, over any fabric.

  • Think of MPI as the IETF of HPC comms: it defines the standard; implementations (OpenMPI, MPICH, Intel MPI) provide the code.
  • MPI is process-centric — each process gets a rank (like a router ID), and you address messages to ranks.
  • In modern AI it runs alongside NCCL: MPI handles process launch and CPU-side coordination; NCCL moves the GPU data.
📡 Analogy — control plane vs data plane

MPI is your control plane (BGP): it establishes the topology, assigns ranks, coordinates barriers. NCCL is your data plane (MPLS): it actually moves the data at line rate. In a serious cluster you run both — you wouldn't run one without the other.


3. Transport-abstraction libraries

These sit below the collective libraries and abstract over the hardware fabric. Write one piece of code; run it over InfiniBand, RoCE, TCP, or shared memory without rewriting the transport logic.

LibraryWhat it abstracts
UCX (Unified Communication X)IB Verbs, RoCE, TCP, shared memory — one API for all
libfabric (OFI)OpenFabrics Interface; Slingshot, EFA, PSM2, TCP
Gloo (Meta)CPU collectives over TCP/Ethernet; PyTorch's fallback
SHARP (NVIDIA/Mellanox)In-network compute — AllReduce on the IB switch itself
📡 Analogy — UCX as OpenConfig

UCX is to HPC networking what OpenConfig is to network management: a vendor-neutral abstraction layer. Just as OpenConfig configures a Juniper and a Cisco with one data model, UCX lets NCCL or MPI ride InfiniBand, RoCE, or plain TCP through the same API call.


4. Framework-native communication

Modern AI frameworks ship their own communication abstractions that sit on top of NCCL or MPI.

Framework / componentWhat it does
PyTorch Distributed (torch.distributed)Python API for collectives; backend = NCCL, Gloo, or MPI
DeepSpeed ZeROShards optimiser states + gradients; uses NCCL underneath
Megatron-LMTensor + pipeline parallelism; custom collective scheduling on NCCL
JAX pjit / shard_mapXLA compiler-level distribution; NCCL/RCCL via XLA collective ops

This is the layer the ML engineer touches. Everything below it is the layer you touch.


The whole stack, mapped to OSI

If you think in layers, here's the exact OSI equivalent for AI/HPC communication.

L7AI framework APIPyTorch · TensorFlow · JAXL6·5Collective libraryNCCL · RCCL · MPIyou tune thisL4RDMA transportRoCEv2 · IB RC/UD · UCXL3Routing / addressingIP (RoCE) · GRH (IB)L2Lossless fabricPFC · credit flow controlmust be losslessL1Physical interconnectNVLink · 400G · PCIe Gen5
The library lives at layers 5–6. Everything the library assumes — a lossless layer 2 — is your job.
📡 Key insight — why layer 2 has to be lossless

InfiniBand and RoCEv2 demand a lossless fabric because RDMA has no traditional retransmission buffer. A single dropped packet stalls the entire Queue Pair — TCP head-of-line blocking, but with no recovery path. That's why you configure PFC on RoCE and why IB uses credit-based flow control at layer 2. The library assumes the fabric is lossless. Making that assumption hold is the network engineer's job.


💡 What you should remember

#ConceptWhy it matters
1🏭GPU collective libraries are the workhorseNCCL / RCCL / oneCCL — same API family, different silicon
2🔀MPI + NCCL = control plane + data planeMPI launches and coordinates; NCCL moves the bytes
3🧱UCX / libfabric abstract the fabricOne API over IB, RoCE, TCP, or shared memory
4📐The collective library lives at OSI 5–6Transport (4) is RDMA; the lossless job lives at layer 2

Next: Collective Operations, the Routing Protocols of AI → — AllReduce, Broadcast, AllGather, ReduceScatter, AllToAll and friends, each mapped to a protocol you already run.