Collectives — the network's job
:::info Bridge — what you already know A collective operation is the same idea as a routing protocol: multiple nodes each have a partial view of the truth, and they coordinate to reach a shared state. BGP converges on a shared RIB. AllReduce converges on a shared gradient. The difference: BGP does it slowly and rarely. AllReduce does it at line rate, every step, across thousands of nodes simultaneously. :::
The four collectives you'll encounter
Every AI framework (PyTorch DDP, DeepSpeed, Megatron-LM, JAX) orchestrates training through one library: NCCL (NVIDIA) or RCCL (AMD) or oneCCL (Intel). These libraries implement a small set of collective operations. Each one makes a different demand on the fabric.
| Collective | Who has what before | Who has what after | Main user |
|---|---|---|---|
| AllReduce | Each rank has a full tensor (different values) | Every rank has the summed/averaged tensor | Data parallelism — gradient sync |
| AllGather | Each rank has a shard of a tensor | Every rank has the full tensor | Tensor parallelism — ZeRO stage 3 |
| ReduceScatter | Each rank has a full tensor | Each rank has one reduced shard | Tensor parallelism — ZeRO stage 2/3 |
| Broadcast | Rank 0 has a tensor | Every rank has rank 0's tensor | Model parameter initialization |
| AllToAll | Each rank has N chunks (one per peer) | Each rank has received one chunk from every peer | Expert parallelism (MoE) |
| Reduce | Each rank has a full tensor | Only rank 0 has the sum | Gradient aggregation to master |
AllReduce — the one that matters most
AllReduce is the dominant collective in training. Understand this one deeply and the others are variations.
What it does: Every rank starts with its own gradient tensor. Every rank ends with the same averaged tensor. No rank is "the server" — the operation is fully distributed.
Naive implementation (don't do this): Every rank sends its full tensor to rank 0. Rank 0 sums them all and sends the result back.
Problem: rank 0 is the bottleneck. Its NIC receives N × tensor_size simultaneously (incast), then sends N × tensor_size back. At 512 GPUs with 28GB gradients, rank 0 needs to move 14TB per step. No NIC can do that.
Ring-AllReduce (what NCCL uses):
Arrange N ranks in a ring. Run in two phases:
Phase 1 — Scatter-Reduce (N-1 steps): Each rank sends one chunk to the next rank and receives one chunk from the previous rank. After each step, the received chunk is summed with the local chunk. After N-1 steps, each rank holds one fully-reduced chunk (the complete sum of that chunk from all ranks).
Phase 2 — AllGather (N-1 steps): Each rank sends its complete chunk to the next rank. After N-1 steps, every rank has received all chunks and holds the full averaged tensor.
Why this is nearly optimal:
Data sent per rank = 2 × (N-1)/N × tensor_size
At N=512:
Data sent per rank = 2 × (511/512) × 28GB ≈ 55.9GB ≈ 2 × tensor_size
Each rank sends at exactly its NIC's full bandwidth.
No bottleneck. Near-perfect utilization.
What it looks like on the wire:
Ring: GPU0 → GPU1 → GPU2 → ... → GPU511 → GPU0
Each GPU simultaneously:
sends one RDMA WRITE to the next GPU in the ring
receives one RDMA WRITE from the previous GPU
Network sees:
512 simultaneous RDMA WRITE flows
Each flow: ~55MB per ring step (28GB ÷ 512 ranks ÷ 2 phases × ...)
All flows launch within microseconds of each other
Perfect all-to-all-adjacent pattern: every NIC is both TX and RX simultaneously
NCCL chooses between Ring and Tree:
NCCL_DEBUG=INFO python train.py 2>&1 | grep "AllReduce"
# NCCL INFO AllReduce: algos[0]=RING, algos[1]=TREE
# NCCL INFO Channel 00/08 : 0 1 2 3 4 5 6 7 0
Ring is preferred for large tensors (bandwidth-bound). Tree is preferred for small tensors (latency-bound). NCCL switches automatically based on message size.
AllGather — ZeRO and tensor parallelism
AllGather: every rank starts with a shard, every rank ends with the full tensor.
Where you see it:
- ZeRO-3 (DeepSpeed): model parameters are sharded across ranks. Before each layer's forward pass, ranks AllGather the weights for that layer. After the layer, the weights are discarded. Net effect: you can train a model that doesn't fit on one GPU.
- Tensor parallelism: within a transformer layer, the weight matrix is sharded across GPUs within a node. After computing the partial result, AllGather combines them.
Traffic pattern:
N ranks, each holding 1/N of a tensor of size S
Phase 1: rank 0 sends shard_0 to all others
Phase 2: rank 1 sends shard_1 to all others
...
Each rank receives (N-1) shards, each of size S/N
Total data received per rank = (N-1)/N × S ≈ S
Unlike AllReduce, there's no reduction operation — just movement. The bottleneck is raw bandwidth.
On the fabric: Each rank's NIC receives data from N-1 others simultaneously. At large N, this is the same incast problem as AllReduce. NCCL uses the same ring algorithm to pipeline the transfers.
ReduceScatter — the other half of AllReduce
ReduceScatter: every rank starts with a full tensor, every rank ends with a different reduced shard.
If AllGather is the second phase of ring AllReduce, ReduceScatter is the first phase. They're designed to pair together: ReduceScatter + AllGather = AllReduce.
Where you see it standalone:
- ZeRO-2: gradients are ReduceScattered to their owners. Each rank only keeps the gradient for the parameters it owns. Memory use per rank: 1/N of total gradient size.
- Tensor parallel backward pass: partial gradients from each GPU are ReduceScattered to compute the final per-shard gradient.
AllToAll — the MoE collective
AllToAll: every rank has N chunks (one per peer) and sends each chunk to its target peer. Every rank ends up with N chunks (one from every peer).
This is the hardest collective for the network.
Where you see it: Mixture of Experts (MoE) models route tokens to specialist sub-networks ("experts"). In a distributed MoE:
- Token routing decides which expert handles which token
- Tokens must move from the GPU that holds them to the GPU that holds the right expert
- After the expert processes the token, the result must come back
Every layer of a MoE model has two AllToAll operations (send to expert, receive back).
Why this is hard:
AllReduce: each rank sends to the next in a ring — predictable, adjacent
AllToAll: each rank sends to every other rank — unpredictable, non-adjacent
Traffic per link depends on token routing:
If tokens cluster to expert 0: GPU 0 receives 10× normal traffic
Other GPUs send to GPU 0 simultaneously → incast
ECMP cannot predict which spine carries which expert's traffic
Adaptive routing helps but cannot eliminate the asymmetry
What you'll see on the fabric:
- Highly variable link utilization — some paths carry 5× more than others within a single AllToAll
- DCQCN rate reduction on hot paths, leaving other paths underutilized
- ECMP imbalance is much worse for AllToAll than for ring AllReduce
How NCCL maps collectives to the fabric
NCCL doesn't just call one collective at a time. It pipelines operations and uses multiple communication "channels" in parallel:
# See NCCL's channel configuration:
NCCL_DEBUG=INFO python train.py 2>&1 | grep "Channel"
# Example output:
# NCCL INFO Channel 00/08 : 0 1 2 3 4 5 6 7 ← GPU ordering in ring 0
# NCCL INFO Channel 01/08 : 0 2 4 6 1 3 5 7 ← GPU ordering in ring 1 (interleaved)
# NCCL INFO Channel 02/08 : 0 4 1 5 2 6 3 7 ← ring 2
# ...
# NCCL INFO Channel 07/08 : ... ← ring 7 (8 total)
8 rings × 8 NICs = NCCL assigns one NIC per ring per direction. Each NIC carries independent AllReduce traffic for a different chunk of the gradient. This is what fully utilizes all 8 NICs on each node simultaneously.
The QP count implication:
Each NCCL channel establishes Queue Pairs with every peer:
QPs per node = channels × 2 (send + recv) × (N_nodes - 1) × QPs_per_connection
8 channels × 2 × 63 nodes × 2 QPs_per_connection = 2,016 QPs per node
At 512 GPUs (64 nodes): 2,016 × 64 = 129,024 QPs across the cluster
Each QP is an RDMA connection with its own:
- Send Queue (SQ) and Receive Queue (RQ) in the RNIC
- Memory registration table entry
- PSN (sequence number) tracking
At 512 GPUs, initialization takes 20–120 seconds because creating and transitioning 129,024 QPs through INIT → RTR → RTS is sequential per-rank.
Collective timing at scale
How long does an AllReduce take? The simple bandwidth formula:
time = 2 × (N-1)/N × tensor_size / bandwidth_per_NIC
≈ 2 × tensor_size / bandwidth_per_NIC (at large N)
Example: 7B model, 8×400G NICs, 512 GPUs
= 2 × 28GB / (8 × 50 GB/s)
= 56GB / 400 GB/s
= 140ms
But actual measured: 180–250ms
Gap = ECMP imbalance + PFC pause events + DCQCN rate reduction
+ QP setup overhead + NCCL synchronization overhead
That gap — the 40–110ms between theory and practice — is the network engineer's job.
What you should be able to do now
- Explain what AllReduce does and why it's nearly optimal bandwidth-wise
- Describe why AllToAll is harder for the fabric than AllReduce
- Interpret NCCL channel configuration output
- Calculate expected AllReduce time and compare it to measured time
- Explain why QP count at scale causes slow NCCL initialization
Where it breaks
AllReduce stall — one rank never arriving
One GPU process crashes or hangs during the backward pass. It never calls into the AllReduce. The other N-1 GPUs enter the AllReduce and wait. Forever — or until NCCL_TIMEOUT fires.
# Tune NCCL timeout (seconds):
export NCCL_TIMEOUT=1800 # 30 minutes (default varies by framework)
# Watch for:
# [W ProcessGroupNCCL.cpp] Timeout at NCCL work ...
# This means one rank died. Check the node for OOM, GPU error, hardware failure.
AllToAll imbalance tanking utilization
MoE training at 256+ GPUs. Some experts are much more popular than others ("hot experts"). The GPUs holding those experts receive 5–10× normal AllToAll traffic. Their ToR ports hit 90% utilization. DCQCN reduces all senders' rates. The other GPUs, sending to non-hot experts, now also send slowly — even though their paths aren't congested.
# Check per-spine utilization:
# (on your switch CLI — Arista example)
show interfaces ethernet 1/1 counters rates
# If one spine is at 90% and others at 30%, you have hot-expert imbalance.
# Mitigation: expert load balancing (auxiliary load balancing loss term)
# or capacity-based token routing in the model code — this is an ML problem,
# but you need to recognize it from the network side.
What's next
You now know what the collectives are and what they produce on the wire. The next question is: how does parallelism strategy change which collectives run, how often, and on which paths?