Skip to main content

Collectives — the network's job

:::info Bridge — what you already know A collective operation is the same idea as a routing protocol: multiple nodes each have a partial view of the truth, and they coordinate to reach a shared state. BGP converges on a shared RIB. AllReduce converges on a shared gradient. The difference: BGP does it slowly and rarely. AllReduce does it at line rate, every step, across thousands of nodes simultaneously. :::


The four collectives you'll encounter

Every AI framework (PyTorch DDP, DeepSpeed, Megatron-LM, JAX) orchestrates training through one library: NCCL (NVIDIA) or RCCL (AMD) or oneCCL (Intel). These libraries implement a small set of collective operations. Each one makes a different demand on the fabric.

CollectiveWho has what beforeWho has what afterMain user
AllReduceEach rank has a full tensor (different values)Every rank has the summed/averaged tensorData parallelism — gradient sync
AllGatherEach rank has a shard of a tensorEvery rank has the full tensorTensor parallelism — ZeRO stage 3
ReduceScatterEach rank has a full tensorEach rank has one reduced shardTensor parallelism — ZeRO stage 2/3
BroadcastRank 0 has a tensorEvery rank has rank 0's tensorModel parameter initialization
AllToAllEach rank has N chunks (one per peer)Each rank has received one chunk from every peerExpert parallelism (MoE)
ReduceEach rank has a full tensorOnly rank 0 has the sumGradient aggregation to master

AllReduce — the one that matters most

AllReduce is the dominant collective in training. Understand this one deeply and the others are variations.

What it does: Every rank starts with its own gradient tensor. Every rank ends with the same averaged tensor. No rank is "the server" — the operation is fully distributed.

Naive implementation (don't do this): Every rank sends its full tensor to rank 0. Rank 0 sums them all and sends the result back.

Problem: rank 0 is the bottleneck. Its NIC receives N × tensor_size simultaneously (incast), then sends N × tensor_size back. At 512 GPUs with 28GB gradients, rank 0 needs to move 14TB per step. No NIC can do that.

Ring-AllReduce (what NCCL uses):

Arrange N ranks in a ring. Run in two phases:

Phase 1 — Scatter-Reduce (N-1 steps): Each rank sends one chunk to the next rank and receives one chunk from the previous rank. After each step, the received chunk is summed with the local chunk. After N-1 steps, each rank holds one fully-reduced chunk (the complete sum of that chunk from all ranks).

Phase 2 — AllGather (N-1 steps): Each rank sends its complete chunk to the next rank. After N-1 steps, every rank has received all chunks and holds the full averaged tensor.

Why this is nearly optimal:

Data sent per rank = 2 × (N-1)/N × tensor_size

At N=512:
Data sent per rank = 2 × (511/512) × 28GB ≈ 55.9GB ≈ 2 × tensor_size

Each rank sends at exactly its NIC's full bandwidth.
No bottleneck. Near-perfect utilization.

What it looks like on the wire:

Ring: GPU0 → GPU1 → GPU2 → ... → GPU511 → GPU0

Each GPU simultaneously:
sends one RDMA WRITE to the next GPU in the ring
receives one RDMA WRITE from the previous GPU

Network sees:
512 simultaneous RDMA WRITE flows
Each flow: ~55MB per ring step (28GB ÷ 512 ranks ÷ 2 phases × ...)
All flows launch within microseconds of each other
Perfect all-to-all-adjacent pattern: every NIC is both TX and RX simultaneously

NCCL chooses between Ring and Tree:

NCCL_DEBUG=INFO python train.py 2>&1 | grep "AllReduce"
# NCCL INFO AllReduce: algos[0]=RING, algos[1]=TREE
# NCCL INFO Channel 00/08 : 0 1 2 3 4 5 6 7 0

Ring is preferred for large tensors (bandwidth-bound). Tree is preferred for small tensors (latency-bound). NCCL switches automatically based on message size.


AllGather — ZeRO and tensor parallelism

AllGather: every rank starts with a shard, every rank ends with the full tensor.

Where you see it:

  • ZeRO-3 (DeepSpeed): model parameters are sharded across ranks. Before each layer's forward pass, ranks AllGather the weights for that layer. After the layer, the weights are discarded. Net effect: you can train a model that doesn't fit on one GPU.
  • Tensor parallelism: within a transformer layer, the weight matrix is sharded across GPUs within a node. After computing the partial result, AllGather combines them.

Traffic pattern:

N ranks, each holding 1/N of a tensor of size S

Phase 1: rank 0 sends shard_0 to all others
Phase 2: rank 1 sends shard_1 to all others
...
Each rank receives (N-1) shards, each of size S/N
Total data received per rank = (N-1)/N × S ≈ S

Unlike AllReduce, there's no reduction operation — just movement. The bottleneck is raw bandwidth.

On the fabric: Each rank's NIC receives data from N-1 others simultaneously. At large N, this is the same incast problem as AllReduce. NCCL uses the same ring algorithm to pipeline the transfers.


ReduceScatter — the other half of AllReduce

ReduceScatter: every rank starts with a full tensor, every rank ends with a different reduced shard.

If AllGather is the second phase of ring AllReduce, ReduceScatter is the first phase. They're designed to pair together: ReduceScatter + AllGather = AllReduce.

Where you see it standalone:

  • ZeRO-2: gradients are ReduceScattered to their owners. Each rank only keeps the gradient for the parameters it owns. Memory use per rank: 1/N of total gradient size.
  • Tensor parallel backward pass: partial gradients from each GPU are ReduceScattered to compute the final per-shard gradient.

AllToAll — the MoE collective

AllToAll: every rank has N chunks (one per peer) and sends each chunk to its target peer. Every rank ends up with N chunks (one from every peer).

This is the hardest collective for the network.

Where you see it: Mixture of Experts (MoE) models route tokens to specialist sub-networks ("experts"). In a distributed MoE:

  • Token routing decides which expert handles which token
  • Tokens must move from the GPU that holds them to the GPU that holds the right expert
  • After the expert processes the token, the result must come back

Every layer of a MoE model has two AllToAll operations (send to expert, receive back).

Why this is hard:

AllReduce: each rank sends to the next in a ring — predictable, adjacent
AllToAll: each rank sends to every other rank — unpredictable, non-adjacent

Traffic per link depends on token routing:
If tokens cluster to expert 0: GPU 0 receives 10× normal traffic
Other GPUs send to GPU 0 simultaneously → incast
ECMP cannot predict which spine carries which expert's traffic
Adaptive routing helps but cannot eliminate the asymmetry

What you'll see on the fabric:

  • Highly variable link utilization — some paths carry 5× more than others within a single AllToAll
  • DCQCN rate reduction on hot paths, leaving other paths underutilized
  • ECMP imbalance is much worse for AllToAll than for ring AllReduce

How NCCL maps collectives to the fabric

NCCL doesn't just call one collective at a time. It pipelines operations and uses multiple communication "channels" in parallel:

# See NCCL's channel configuration:
NCCL_DEBUG=INFO python train.py 2>&1 | grep "Channel"

# Example output:
# NCCL INFO Channel 00/08 : 0 1 2 3 4 5 6 7 ← GPU ordering in ring 0
# NCCL INFO Channel 01/08 : 0 2 4 6 1 3 5 7 ← GPU ordering in ring 1 (interleaved)
# NCCL INFO Channel 02/08 : 0 4 1 5 2 6 3 7 ← ring 2
# ...
# NCCL INFO Channel 07/08 : ... ← ring 7 (8 total)

8 rings × 8 NICs = NCCL assigns one NIC per ring per direction. Each NIC carries independent AllReduce traffic for a different chunk of the gradient. This is what fully utilizes all 8 NICs on each node simultaneously.

The QP count implication:

Each NCCL channel establishes Queue Pairs with every peer:

QPs per node = channels × 2 (send + recv) × (N_nodes - 1) × QPs_per_connection

8 channels × 2 × 63 nodes × 2 QPs_per_connection = 2,016 QPs per node

At 512 GPUs (64 nodes): 2,016 × 64 = 129,024 QPs across the cluster

Each QP is an RDMA connection with its own:

  • Send Queue (SQ) and Receive Queue (RQ) in the RNIC
  • Memory registration table entry
  • PSN (sequence number) tracking

At 512 GPUs, initialization takes 20–120 seconds because creating and transitioning 129,024 QPs through INIT → RTR → RTS is sequential per-rank.


Collective timing at scale

How long does an AllReduce take? The simple bandwidth formula:

time = 2 × (N-1)/N × tensor_size / bandwidth_per_NIC
≈ 2 × tensor_size / bandwidth_per_NIC (at large N)

Example: 7B model, 8×400G NICs, 512 GPUs
= 2 × 28GB / (8 × 50 GB/s)
= 56GB / 400 GB/s
= 140ms

But actual measured: 180–250ms

Gap = ECMP imbalance + PFC pause events + DCQCN rate reduction
+ QP setup overhead + NCCL synchronization overhead

That gap — the 40–110ms between theory and practice — is the network engineer's job.


What you should be able to do now

  • Explain what AllReduce does and why it's nearly optimal bandwidth-wise
  • Describe why AllToAll is harder for the fabric than AllReduce
  • Interpret NCCL channel configuration output
  • Calculate expected AllReduce time and compare it to measured time
  • Explain why QP count at scale causes slow NCCL initialization

Where it breaks

AllReduce stall — one rank never arriving

One GPU process crashes or hangs during the backward pass. It never calls into the AllReduce. The other N-1 GPUs enter the AllReduce and wait. Forever — or until NCCL_TIMEOUT fires.

# Tune NCCL timeout (seconds):
export NCCL_TIMEOUT=1800 # 30 minutes (default varies by framework)

# Watch for:
# [W ProcessGroupNCCL.cpp] Timeout at NCCL work ...
# This means one rank died. Check the node for OOM, GPU error, hardware failure.

AllToAll imbalance tanking utilization

MoE training at 256+ GPUs. Some experts are much more popular than others ("hot experts"). The GPUs holding those experts receive 5–10× normal AllToAll traffic. Their ToR ports hit 90% utilization. DCQCN reduces all senders' rates. The other GPUs, sending to non-hot experts, now also send slowly — even though their paths aren't congested.

# Check per-spine utilization:
# (on your switch CLI — Arista example)
show interfaces ethernet 1/1 counters rates
# If one spine is at 90% and others at 30%, you have hot-expert imbalance.

# Mitigation: expert load balancing (auxiliary load balancing loss term)
# or capacity-based token routing in the model code — this is an ML problem,
# but you need to recognize it from the network side.

What's next

You now know what the collectives are and what they produce on the wire. The next question is: how does parallelism strategy change which collectives run, how often, and on which paths?

Next → Parallelism Strategies