What a Communication Library Is

When a model outgrows a single GPU, the GPUs have to talk. The communication library is what they talk through.

After this page, you'll be able to

Place the library in the stack — it's the transport + session layer for tensors, sitting between the AI framework above and the fabric below.
Explain why TCP sockets don't scale — one AllReduce on a 70B model moves ~280 GB per step: ~22 s on 100 GbE, ~2 s with RDMA.
Name the four problems it solves — gradient sync, bandwidth waste, latency jitter, and topology blindness.

What it actually is

A large model — GPT-4, Llama-3 — doesn't fit on one GPU, so you spread it across hundreds or thousands of them. The moment you do, those GPUs must constantly exchange data: gradients, synchronised weights, activations.

A communication library is the software layer that makes that conversation happen — reliably, and at the speed of the hardware. It abstracts the physical fabric (NVLink, PCIe, InfiniBand, RoCE) behind one clean API to the framework above.

In one sentence: a communication library is the transport + session layer for tensors in a distributed AI or HPC system.

What it replaces

Without one, every AI framework would hand-write its own InfiniBand Verbs calls, manage RDMA queue pairs, handle retransmission, detect topology, and optimise data paths.

That is exactly what happened in the early days of deep learning — and it was painful.

Libraries like NCCL exist to solve that once, for everyone.

Why a socket won't do

To see the problem the library solves, look at a single training step.

Step 1 — Data parallelism. The dataset is split across N GPU workers. Each worker processes a different mini-batch and computes its own gradients — the error signals used to update the weights.

Step 2 — Gradient synchronisation. Before updating weights, every worker must agree on the same averaged gradient. If GPU 0 computed ∂L/∂W = 1.2 and GPU 1 computed 0.8, the true gradient is 1.0. Every GPU must hold that 1.0 before it takes the next step.

Step 3 — Weight update. Only after synchronisation does each GPU update its local copy of the model. Skip the sync and the copies diverge — like routers with split-brain routing tables.

📡 Analogy — BGP convergence

Gradient sync is BGP route convergence. Each AS (GPU) has its own local view (its gradient). The protocol (AllReduce) has to drive every AS to the same globally-consistent table (the averaged gradient) before any traffic (the next training step) is forwarded.

The numbers are brutal

You could write gradient sync over TCP. It falls apart on scale and latency.

A single AllReduce on a 70B-parameter model moves ~280 GB of gradient data per step.
At 100 Gbps Ethernet that's ~22 seconds per step. On InfiniBand HDR (200 Gbps) with kernel-bypass RDMA: ~2 seconds.
Libraries exploit RDMA — data moves GPU memory → GPU memory across the fabric, bypassing the CPU and the OS kernel entirely.
They also run topology-aware algorithms: traffic takes different paths over NVLink (same node), PCIe (same host), or InfiniBand (cross-host) to squeeze out effective bandwidth.

📡 Analogy — SR-IOV and kernel bypass

RDMA is to TCP/IP what SR-IOV is to a virtualised NIC. SR-IOV lets a VM hit the physical NIC directly, bypassing the hypervisor. RDMA lets a GPU cross the fabric directly, bypassing the OS kernel. Both delete a software bottleneck to get closer to wire speed.

The four problems it solves

Problem	Without a library	What the library does
Gradient sync	Workers diverge; the model never learns	AllReduce collective
Bandwidth waste	Naive point-to-point leaves links idle	Ring / tree algorithms that saturate every link
Latency spikes	The OS kernel adds milliseconds of jitter per hop	RDMA kernel-bypass — microsecond latency
Topology blindness	Every path treated equally; NVLink wasted	Auto-detection: NVLink > PCIe > InfiniBand > Ethernet

Everything else in this section is a detail of one of these four.

💡 What you should remember

#		Concept	Why it matters
1	🧩	The library is the transport + session layer for tensors	It's the thing between your AI framework and the fabric you run
2	🚫	TCP can't carry ~280 GB every step	22 s vs 2 s — RDMA kernel-bypass isn't optional at scale
3	🗺️	Topology awareness is built in	NVLink > PCIe > IB > Ethernet, chosen for you

Next: The Library Landscape → — the four kinds of library (GPU collective, MPI, transport abstraction, framework-native), and the whole thing mapped onto the OSI layers you already think in.

What it actually is​

What it replaces​

Why a socket won't do​

The numbers are brutal​

The four problems it solves​

💡 What you should remember​