The Library Landscape

"Communication library" isn't one thing. It's four layers of them, stacked.

After this page, you'll be able to

Tell the four categories apart — GPU collective libraries, MPI, transport-abstraction layers, and framework-native APIs.
Place the big names — where NCCL, MPI, UCX, and PyTorch Distributed sit relative to each other.
Map it onto OSI — layers 1–7 for AI comms, and why layer 2 has to be lossless.

1. GPU collective libraries — the workhorse

Purpose-built for GPU-to-GPU communication during training. They implement the collective operations (AllReduce, Broadcast, AllGather…) and are tightly coupled to GPU memory and the interconnect.

Library	Vendor / target	Key characteristic
NCCL	NVIDIA / CUDA GPUs	Ring AllReduce, NVLink-aware, dominant in production
RCCL	AMD / ROCm GPUs	NCCL-compatible API, MI300X-optimised
oneCCL	Intel / Gaudi, CPU	Intel-MPI-based, Gaudi NIC integration
MSCCL	Microsoft (NCCL fork)	Custom topology schedules, Azure HPC clusters

📡 Analogy — vendor routing stacks

NCCL is the Cisco IOS of GPU clusters: dominant, battle-tested, deeply hardware-integrated. RCCL is the Junos equivalent — compatible API (think NETCONF), different silicon underneath. oneCCL is the open-source VyOS: flexible, vendor-neutral, not always the fastest.

2. MPI — the standard

The Message Passing Interface is the original HPC communication standard, dating to 1994. It defines a universal API for point-to-point and collective communication between processes — on CPUs, across nodes, over any fabric.

Think of MPI as the IETF of HPC comms: it defines the standard; implementations (OpenMPI, MPICH, Intel MPI) provide the code.
MPI is process-centric — each process gets a rank (like a router ID), and you address messages to ranks.
In modern AI it runs alongside NCCL: MPI handles process launch and CPU-side coordination; NCCL moves the GPU data.

📡 Analogy — control plane vs data plane

MPI is your control plane (BGP): it establishes the topology, assigns ranks, coordinates barriers. NCCL is your data plane (MPLS): it actually moves the data at line rate. In a serious cluster you run both — you wouldn't run one without the other.

3. Transport-abstraction libraries

These sit below the collective libraries and abstract over the hardware fabric. Write one piece of code; run it over InfiniBand, RoCE, TCP, or shared memory without rewriting the transport logic.

Library	What it abstracts
UCX (Unified Communication X)	IB Verbs, RoCE, TCP, shared memory — one API for all
libfabric (OFI)	OpenFabrics Interface; Slingshot, EFA, PSM2, TCP
Gloo (Meta)	CPU collectives over TCP/Ethernet; PyTorch's fallback
SHARP (NVIDIA/Mellanox)	In-network compute — AllReduce on the IB switch itself

📡 Analogy — UCX as OpenConfig

UCX is to HPC networking what OpenConfig is to network management: a vendor-neutral abstraction layer. Just as OpenConfig configures a Juniper and a Cisco with one data model, UCX lets NCCL or MPI ride InfiniBand, RoCE, or plain TCP through the same API call.

4. Framework-native communication

Modern AI frameworks ship their own communication abstractions that sit on top of NCCL or MPI.

Framework / component	What it does
PyTorch Distributed (`torch.distributed`)	Python API for collectives; backend = NCCL, Gloo, or MPI
DeepSpeed ZeRO	Shards optimiser states + gradients; uses NCCL underneath
Megatron-LM	Tensor + pipeline parallelism; custom collective scheduling on NCCL
JAX `pjit` / `shard_map`	XLA compiler-level distribution; NCCL/RCCL via XLA collective ops

This is the layer the ML engineer touches. Everything below it is the layer you touch.

The whole stack, mapped to OSI

If you think in layers, here's the exact OSI equivalent for AI/HPC communication.

The library lives at layers 5–6. Everything the library assumes — a lossless layer 2 — is your job.

📡 Key insight — why layer 2 has to be lossless

InfiniBand and RoCEv2 demand a lossless fabric because RDMA has no traditional retransmission buffer. A single dropped packet stalls the entire Queue Pair — TCP head-of-line blocking, but with no recovery path. That's why you configure PFC on RoCE and why IB uses credit-based flow control at layer 2. The library assumes the fabric is lossless. Making that assumption hold is the network engineer's job.

💡 What you should remember

#		Concept	Why it matters
1	🏭	GPU collective libraries are the workhorse	NCCL / RCCL / oneCCL — same API family, different silicon
2	🔀	MPI + NCCL = control plane + data plane	MPI launches and coordinates; NCCL moves the bytes
3	🧱	UCX / libfabric abstract the fabric	One API over IB, RoCE, TCP, or shared memory
4	📐	The collective library lives at OSI 5–6	Transport (4) is RDMA; the lossless job lives at layer 2

Next: Collective Operations, the Routing Protocols of AI → — AllReduce, Broadcast, AllGather, ReduceScatter, AllToAll and friends, each mapped to a protocol you already run.

1. GPU collective libraries — the workhorse​

2. MPI — the standard​

3. Transport-abstraction libraries​

4. Framework-native communication​

The whole stack, mapped to OSI​

💡 What you should remember​