Skip to main content

Collective Operations, the Routing Protocols of AI

A collective is an operation that every GPU in the job runs together — and nobody moves on until the group converges.

After this page, you'll be able to
  1. Say what a collective is — a group operation every rank joins, like a protocol that needs full convergence before the network is usable.
  2. Sketch AllReduce — two phases, bandwidth-optimal, and why it's ~90% of training traffic.
  3. Map all eight — AllReduce, Broadcast, AllGather, ReduceScatter, Scatter, Gather, Barrier, ReduceBroadcast — each to a protocol you already run.

AllReduce — the one that runs every step

AllReduce synchronises gradients at every single training step. It's the operation that makes data-parallel training possible.

📡 Analogy — OSPF flooding

AllReduce works like OSPF flooding. Every router (GPU) starts with its own local link-state advertisement (its gradient). The protocol floods until every router holds a complete, globally-consistent link-state database (the averaged gradient). The difference: OSPF converges in seconds; AllReduce has to converge in milliseconds, every few seconds, for weeks.

Ring AllReduce, in brief

NCCL implements AllReduce as a ring — an algorithm that gets near-perfect bandwidth utilisation:

  • Each GPU's gradient vector is split into N equal chunks (N = number of GPUs).
  • Phase 1 — ReduceScatter: each GPU sends a chunk to the next in the ring; the receiver adds it to its own. After N−1 steps, each GPU holds one fully-summed chunk.
  • Phase 2 — AllGather: each GPU passes its summed chunk around the ring. After N−1 more steps, every GPU holds the complete averaged gradient.
Data moved per GPU = 2 × (N-1)/N × tensor_size ≈ 2 × tensor_size

That's nearly optimal — the per-GPU cost barely grows with N.

📡 Analogy — token ring

Ring AllReduce is literally a token ring. Each GPU is a station; the gradient chunk is the token; each station receives, adds its data, and forwards. Unlike classic token ring, both phases run at once on different chunks — so every link is always carrying data and utilisation approaches 100%.

The full on-the-wire picture — 512 simultaneous flows, parallel channels, the microburst shape your fabric has to absorb — is the next page.


All eight collectives, mapped to networking

Every other collective is a variation on the same idea. Here's the whole set against protocols you already run:

AllReducesum → everyoneΣΣΣΣBroadcastone → allsrcAllGathereach → all hold allReduceScattersum, then splitAllToAllall → all (MoE)
Every collective is one of these movement patterns. AllReduce (top-left) is ~90% of training traffic; AllToAll (bottom-right) is the one that hurts.
CollectiveWhat it doesNetwork analogy
AllReduceAll nodes contribute → sum/avg sent to allOSPF flooding — everyone shares, everyone gets the full picture
BroadcastOne node → same data → all nodesPIM-SM multicast — one source, tree distribution
AllGatherEach contributes a unique chunk → all get the full setBGP full mesh — every AS shares routes, all build the full table
ReduceScatterReduce first, then hand each node one unique chunkRoute summarisation + splitting address space across regions
ScatterRoot splits data → a unique chunk to each nodeUnicast fan-out from a route reflector to its clients
GatherAll nodes send to root → root aggregates onlysFlow / NetFlow — endpoints stream to a central collector
BarrierNo node proceeds until all have checked inTwo-phase commit / IS-IS adjacency hold-down
ReduceBroadcastReduce to root, then broadcast the result to allCentralised NTP — collect offsets, distribute corrected time

You don't need to memorise all eight. You need to recognise them in a trace, and know which parallelism strategy produces which — that mapping is in Parallelism Strategies.


💡 What you should remember

#ConceptWhy it matters
1🤝A collective needs every rank to convergeOne slow rank stalls the whole job — like a protocol that won't converge
2🔄AllReduce = ReduceScatter + AllGatherTwo phases, ~2× the tensor size per GPU, near-optimal at any N
3🗺️Every collective maps to a protocol you knowBroadcast = multicast, AllGather = BGP mesh, Barrier = 2-phase commit

Next: The Collective That Runs Every Step → — AllReduce on the wire: 512 simultaneous flows, parallel NCCL channels, the microburst shape, and the theory-vs-measured gap that's your job to close.