The Collective That Runs Every Step
90% of what runs on an AI fabric is AllReduce. Understand this one and the rest are variations.
What AllReduce actually does
Every GPU starts with its own gradient tensor (different values, computed from its own training data slice).
After AllReduce, every GPU holds the same tensor: the sum (or average) across all GPUs.
No GPU is "the server." No coordinator. The operation is fully distributed — by design, because anything else creates a bottleneck.
The naive version is unworkable
Try "send everything to GPU 0 and broadcast back" at 512 GPUs with 28 GB gradients:
- GPU 0 receives 28 GB from every other GPU → 14 TB of incast on one NIC.
- GPU 0 sends 28 GB back to every other GPU → another 14 TB outbound.
No NIC in existence can do that. NCCL doesn't try.
Ring AllReduce — what NCCL actually does
Arrange N GPUs in a logical ring: GPU0 → GPU1 → GPU2 → ... → GPU(N-1) → GPU0. Run two phases:
Phase 1 — Scatter-Reduce (N-1 steps): Each GPU sends one chunk to the next in the ring, receives one chunk from the previous. The received chunk gets summed with the local one. After N-1 steps, every GPU holds one complete reduced chunk — the sum of that chunk across all GPUs.
Phase 2 — AllGather (N-1 steps): Each GPU broadcasts its complete chunk around the ring. After N-1 steps, every GPU holds the full reduced tensor.
Why this is nearly bandwidth-optimal:
Data each GPU sends = 2 × (N-1)/N × tensor_size
≈ 2 × tensor_size (at large N)
At N=512, 28 GB gradient:
Data per GPU = ~56 GB
Every GPU sends at NIC line rate, simultaneously.
No bottleneck. Bandwidth-optimal.
This is what your fabric has to handle. N GPUs all sending at NIC line rate, simultaneously, in a stable adjacent-pair pattern. Every step. For weeks.
What it looks like on the wire
The moment AllReduce launches:
- 512 simultaneous RDMA WRITE flows (one per ring position)
- Each flow ~55 MB per ring step
- All flows start within microseconds of each other
- Every NIC is both transmitting and receiving at line rate
- Traffic shape is a microburst — peak fabric utilization, then quiet, repeat every few seconds
This is why your ECMP hash distribution matters. This is why hash polarization is catastrophic here. This is why DCQCN tuning has a dollar value.
NCCL builds many rings in parallel
NCCL doesn't build one ring — it builds 8 to 16, mapping each to a different NIC + GPU pair:
NCCL_DEBUG=INFO python train.py 2>&1 | grep "Channel"
# NCCL INFO Channel 00/08 : 0 1 2 3 4 5 6 7 ← ring 0
# NCCL INFO Channel 01/08 : 0 2 4 6 1 3 5 7 ← ring 1 (interleaved)
# NCCL INFO Channel 02/08 : 0 4 1 5 2 6 3 7
# ...
Each ring carries a different chunk of the gradient. Effect: each of the 8 NICs on a node runs an independent AllReduce in parallel — that's how 8 × 400 Gbps of host bandwidth actually gets used.
For 512 GPUs across 64 nodes, NCCL creates roughly 2,000 QP connections per node to make this work — which is why initialization can take 20–120 seconds before training even starts.
The other collectives, briefly
Four more you'll meet in the wild. All variations on the same theme.
AllGather — every GPU starts with a shard, ends with the full tensor. Used by ZeRO (memory-efficient training) and tensor parallelism. Same ring mechanism as AllReduce phase 2.
ReduceScatter — every GPU starts with the full tensor, ends with one reduced shard. ZeRO-2 uses this for gradient distribution. Same as AllReduce phase 1.
Broadcast — rank 0 has a tensor, every other rank ends with rank 0's copy. Used for parameter initialization at job start. Tiny traffic, once.
AllToAll — every GPU has N chunks (one per peer) and sends each chunk to its target peer. This is the hardest one for your fabric. No ring structure. Every GPU sends to every other, all at once, with traffic volumes that depend on token routing — i.e., unpredictable.
AllToAll = the MoE problem
If the customer is training a Mixture-of-Experts model — Mixtral, GPT-4 (rumored), DeepSeek — AllToAll is the dominant collective. And it's painful:
- Traffic per link depends on token routing — some experts get hot, others stay cold
- ECMP can't predict which spine carries which expert's traffic
- DCQCN rate reduction on hot paths leaves other paths underutilized
- Adaptive routing helps but cannot eliminate the asymmetry
When one spine port is hammered at 90% while three others sit at 30%, and the training is MoE, it's almost certainly hot-expert imbalance. That's a model-side problem — but you'll be the first to spot it.
Theory vs reality
The bandwidth-optimal AllReduce time:
time ≈ 2 × tensor_size / bandwidth_per_NIC
7B model, 8 × 400G NICs, 512 GPUs:
time = 2 × 28 GB / (8 × 50 GB/s)
= 140 ms
But measured: 180–250 ms
That 40–110 ms gap between theoretical and measured is your job. ECMP imbalance, PFC pause events, DCQCN rate adjustments, QP setup overhead — every one of them widens the gap.
Closing that gap pays for itself in days.
What you should remember
- AllReduce is 90% of training traffic. Two phases. Ring mechanics. Bandwidth-optimal at large N.
- N simultaneous flows, all at line rate, every few seconds. That's the shape that defines your fabric.
- Multiple rings in parallel (one per NIC) is how 8 × 400 Gbps actually gets used.
- AllToAll = MoE = your hardest problem. Unpredictable per-link load, hot experts, ECMP can't help.
- The 40–110 ms gap between theory and measured AllReduce is what good fabric engineering recovers.
Next: When Training Slows → — MFU as the readout, the diagnosis ladder, and the network engineer's reflex for "the fabric is slow" tickets that aren't actually about the fabric.