Collective Operations, the Routing Protocols of AI
A collective is an operation that every GPU in the job runs together — and nobody moves on until the group converges.
- Say what a collective is — a group operation every rank joins, like a protocol that needs full convergence before the network is usable.
- Sketch AllReduce — two phases, bandwidth-optimal, and why it's ~90% of training traffic.
- Map all eight — AllReduce, Broadcast, AllGather, ReduceScatter, Scatter, Gather, Barrier, ReduceBroadcast — each to a protocol you already run.
AllReduce — the one that runs every step
AllReduce synchronises gradients at every single training step. It's the operation that makes data-parallel training possible.
AllReduce works like OSPF flooding. Every router (GPU) starts with its own local link-state advertisement (its gradient). The protocol floods until every router holds a complete, globally-consistent link-state database (the averaged gradient). The difference: OSPF converges in seconds; AllReduce has to converge in milliseconds, every few seconds, for weeks.
Ring AllReduce, in brief
NCCL implements AllReduce as a ring — an algorithm that gets near-perfect bandwidth utilisation:
- Each GPU's gradient vector is split into N equal chunks (N = number of GPUs).
- Phase 1 — ReduceScatter: each GPU sends a chunk to the next in the ring; the receiver adds it to its own. After N−1 steps, each GPU holds one fully-summed chunk.
- Phase 2 — AllGather: each GPU passes its summed chunk around the ring. After N−1 more steps, every GPU holds the complete averaged gradient.
Data moved per GPU = 2 × (N-1)/N × tensor_size ≈ 2 × tensor_size
That's nearly optimal — the per-GPU cost barely grows with N.
Ring AllReduce is literally a token ring. Each GPU is a station; the gradient chunk is the token; each station receives, adds its data, and forwards. Unlike classic token ring, both phases run at once on different chunks — so every link is always carrying data and utilisation approaches 100%.
The full on-the-wire picture — 512 simultaneous flows, parallel channels, the microburst shape your fabric has to absorb — is the next page.
All eight collectives, mapped to networking
Every other collective is a variation on the same idea. Here's the whole set against protocols you already run:
| Collective | What it does | Network analogy |
|---|---|---|
| AllReduce | All nodes contribute → sum/avg sent to all | OSPF flooding — everyone shares, everyone gets the full picture |
| Broadcast | One node → same data → all nodes | PIM-SM multicast — one source, tree distribution |
| AllGather | Each contributes a unique chunk → all get the full set | BGP full mesh — every AS shares routes, all build the full table |
| ReduceScatter | Reduce first, then hand each node one unique chunk | Route summarisation + splitting address space across regions |
| Scatter | Root splits data → a unique chunk to each node | Unicast fan-out from a route reflector to its clients |
| Gather | All nodes send to root → root aggregates only | sFlow / NetFlow — endpoints stream to a central collector |
| Barrier | No node proceeds until all have checked in | Two-phase commit / IS-IS adjacency hold-down |
| ReduceBroadcast | Reduce to root, then broadcast the result to all | Centralised NTP — collect offsets, distribute corrected time |
You don't need to memorise all eight. You need to recognise them in a trace, and know which parallelism strategy produces which — that mapping is in Parallelism Strategies.
💡 What you should remember
| # | Concept | Why it matters | |
|---|---|---|---|
| 1 | 🤝 | A collective needs every rank to converge | One slow rank stalls the whole job — like a protocol that won't converge |
| 2 | 🔄 | AllReduce = ReduceScatter + AllGather | Two phases, ~2× the tensor size per GPU, near-optimal at any N |
| 3 | 🗺️ | Every collective maps to a protocol you know | Broadcast = multicast, AllGather = BGP mesh, Barrier = 2-phase commit |
Next: The Collective That Runs Every Step → — AllReduce on the wire: 512 simultaneous flows, parallel NCCL channels, the microburst shape, and the theory-vs-measured gap that's your job to close.