Skip to main content

What a Communication Library Is

When a model outgrows a single GPU, the GPUs have to talk. The communication library is what they talk through.

After this page, you'll be able to
  1. Place the library in the stack — it's the transport + session layer for tensors, sitting between the AI framework above and the fabric below.
  2. Explain why TCP sockets don't scale — one AllReduce on a 70B model moves ~280 GB per step: ~22 s on 100 GbE, ~2 s with RDMA.
  3. Name the four problems it solves — gradient sync, bandwidth waste, latency jitter, and topology blindness.

What it actually is

A large model — GPT-4, Llama-3 — doesn't fit on one GPU, so you spread it across hundreds or thousands of them. The moment you do, those GPUs must constantly exchange data: gradients, synchronised weights, activations.

A communication library is the software layer that makes that conversation happen — reliably, and at the speed of the hardware. It abstracts the physical fabric (NVLink, PCIe, InfiniBand, RoCE) behind one clean API to the framework above.

In one sentence: a communication library is the transport + session layer for tensors in a distributed AI or HPC system.

What it replaces

Without one, every AI framework would hand-write its own InfiniBand Verbs calls, manage RDMA queue pairs, handle retransmission, detect topology, and optimise data paths.

That is exactly what happened in the early days of deep learning — and it was painful.

Libraries like NCCL exist to solve that once, for everyone.


Why a socket won't do

To see the problem the library solves, look at a single training step.

Step 1 — Data parallelism. The dataset is split across N GPU workers. Each worker processes a different mini-batch and computes its own gradients — the error signals used to update the weights.

Step 2 — Gradient synchronisation. Before updating weights, every worker must agree on the same averaged gradient. If GPU 0 computed ∂L/∂W = 1.2 and GPU 1 computed 0.8, the true gradient is 1.0. Every GPU must hold that 1.0 before it takes the next step.

Step 3 — Weight update. Only after synchronisation does each GPU update its local copy of the model. Skip the sync and the copies diverge — like routers with split-brain routing tables.

📡 Analogy — BGP convergence

Gradient sync is BGP route convergence. Each AS (GPU) has its own local view (its gradient). The protocol (AllReduce) has to drive every AS to the same globally-consistent table (the averaged gradient) before any traffic (the next training step) is forwarded.

The numbers are brutal

You could write gradient sync over TCP. It falls apart on scale and latency.

  • A single AllReduce on a 70B-parameter model moves ~280 GB of gradient data per step.
  • At 100 Gbps Ethernet that's ~22 seconds per step. On InfiniBand HDR (200 Gbps) with kernel-bypass RDMA: ~2 seconds.
  • Libraries exploit RDMA — data moves GPU memory → GPU memory across the fabric, bypassing the CPU and the OS kernel entirely.
  • They also run topology-aware algorithms: traffic takes different paths over NVLink (same node), PCIe (same host), or InfiniBand (cross-host) to squeeze out effective bandwidth.
📡 Analogy — SR-IOV and kernel bypass

RDMA is to TCP/IP what SR-IOV is to a virtualised NIC. SR-IOV lets a VM hit the physical NIC directly, bypassing the hypervisor. RDMA lets a GPU cross the fabric directly, bypassing the OS kernel. Both delete a software bottleneck to get closer to wire speed.


The four problems it solves

ProblemWithout a libraryWhat the library does
Gradient syncWorkers diverge; the model never learnsAllReduce collective
Bandwidth wasteNaive point-to-point leaves links idleRing / tree algorithms that saturate every link
Latency spikesThe OS kernel adds milliseconds of jitter per hopRDMA kernel-bypass — microsecond latency
Topology blindnessEvery path treated equally; NVLink wastedAuto-detection: NVLink > PCIe > InfiniBand > Ethernet

Everything else in this section is a detail of one of these four.


💡 What you should remember

#ConceptWhy it matters
1🧩The library is the transport + session layer for tensorsIt's the thing between your AI framework and the fabric you run
2🚫TCP can't carry ~280 GB every step22 s vs 2 s — RDMA kernel-bypass isn't optional at scale
3🗺️Topology awareness is built inNVLink > PCIe > IB > Ethernet, chosen for you

Next: The Library Landscape → — the four kinds of library (GPU collective, MPI, transport abstraction, framework-native), and the whole thing mapped onto the OSI layers you already think in.