Collectives — the network's job

:::info Bridge — what you already know A collective operation is the same idea as a routing protocol: multiple nodes each have a partial view of the truth, and they coordinate to reach a shared state. BGP converges on a shared RIB. AllReduce converges on a shared gradient. The difference: BGP does it slowly and rarely. AllReduce does it at line rate, every step, across thousands of nodes simultaneously. :::

The four collectives you'll encounter

Every AI framework (PyTorch DDP, DeepSpeed, Megatron-LM, JAX) orchestrates training through one library: NCCL (NVIDIA) or RCCL (AMD) or oneCCL (Intel). These libraries implement a small set of collective operations. Each one makes a different demand on the fabric.

Collective	Who has what before	Who has what after	Main user
AllReduce	Each rank has a full tensor (different values)	Every rank has the summed/averaged tensor	Data parallelism — gradient sync
AllGather	Each rank has a shard of a tensor	Every rank has the full tensor	Tensor parallelism — ZeRO stage 3
ReduceScatter	Each rank has a full tensor	Each rank has one reduced shard	Tensor parallelism — ZeRO stage 2/3
Broadcast	Rank 0 has a tensor	Every rank has rank 0's tensor	Model parameter initialization
AllToAll	Each rank has N chunks (one per peer)	Each rank has received one chunk from every peer	Expert parallelism (MoE)
Reduce	Each rank has a full tensor	Only rank 0 has the sum	Gradient aggregation to master

AllReduce — the one that matters most

AllReduce is the dominant collective in training. Understand this one deeply and the others are variations.

What it does: Every rank starts with its own gradient tensor. Every rank ends with the same averaged tensor. No rank is "the server" — the operation is fully distributed.

Naive implementation (don't do this): Every rank sends its full tensor to rank 0. Rank 0 sums them all and sends the result back.

Problem: rank 0 is the bottleneck. Its NIC receives N × tensor_size simultaneously (incast), then sends N × tensor_size back. At 512 GPUs with 28GB gradients, rank 0 needs to move 14TB per step. No NIC can do that.

Ring-AllReduce (what NCCL uses):

Arrange N ranks in a ring. Run in two phases:

Phase 1 — Scatter-Reduce (N-1 steps): Each rank sends one chunk to the next rank and receives one chunk from the previous rank. After each step, the received chunk is summed with the local chunk. After N-1 steps, each rank holds one fully-reduced chunk (the complete sum of that chunk from all ranks).

Phase 2 — AllGather (N-1 steps): Each rank sends its complete chunk to the next rank. After N-1 steps, every rank has received all chunks and holds the full averaged tensor.

Why this is nearly optimal:

Data sent per rank = 2 × (N-1)/N × tensor_size

At N=512:
  Data sent per rank = 2 × (511/512) × 28GB ≈ 55.9GB ≈ 2 × tensor_size

Each rank sends at exactly its NIC's full bandwidth.
No bottleneck. Near-perfect utilization.

What it looks like on the wire:

Ring: GPU0 → GPU1 → GPU2 → ... → GPU511 → GPU0

Each GPU simultaneously:
  sends one RDMA WRITE to the next GPU in the ring
  receives one RDMA WRITE from the previous GPU

Network sees:
  512 simultaneous RDMA WRITE flows
  Each flow: ~55MB per ring step (28GB ÷ 512 ranks ÷ 2 phases × ...)
  All flows launch within microseconds of each other
  Perfect all-to-all-adjacent pattern: every NIC is both TX and RX simultaneously

NCCL chooses between Ring and Tree:

NCCL_DEBUG=INFO python train.py 2>&1 | grep "AllReduce"
# NCCL INFO AllReduce: algos[0]=RING, algos[1]=TREE
# NCCL INFO Channel 00/08 : 0 1 2 3 4 5 6 7 0

Ring is preferred for large tensors (bandwidth-bound). Tree is preferred for small tensors (latency-bound). NCCL switches automatically based on message size.

AllGather — ZeRO and tensor parallelism

AllGather: every rank starts with a shard, every rank ends with the full tensor.

Where you see it:

ZeRO-3 (DeepSpeed): model parameters are sharded across ranks. Before each layer's forward pass, ranks AllGather the weights for that layer. After the layer, the weights are discarded. Net effect: you can train a model that doesn't fit on one GPU.
Tensor parallelism: within a transformer layer, the weight matrix is sharded across GPUs within a node. After computing the partial result, AllGather combines them.

Traffic pattern:

N ranks, each holding 1/N of a tensor of size S

Phase 1: rank 0 sends shard_0 to all others
Phase 2: rank 1 sends shard_1 to all others
...
Each rank receives (N-1) shards, each of size S/N
Total data received per rank = (N-1)/N × S ≈ S

Unlike AllReduce, there's no reduction operation — just movement. The bottleneck is raw bandwidth.

On the fabric: Each rank's NIC receives data from N-1 others simultaneously. At large N, this is the same incast problem as AllReduce. NCCL uses the same ring algorithm to pipeline the transfers.

ReduceScatter — the other half of AllReduce

ReduceScatter: every rank starts with a full tensor, every rank ends with a different reduced shard.

If AllGather is the second phase of ring AllReduce, ReduceScatter is the first phase. They're designed to pair together: ReduceScatter + AllGather = AllReduce.

Where you see it standalone:

ZeRO-2: gradients are ReduceScattered to their owners. Each rank only keeps the gradient for the parameters it owns. Memory use per rank: 1/N of total gradient size.
Tensor parallel backward pass: partial gradients from each GPU are ReduceScattered to compute the final per-shard gradient.

AllToAll — the MoE collective

AllToAll: every rank has N chunks (one per peer) and sends each chunk to its target peer. Every rank ends up with N chunks (one from every peer).

This is the hardest collective for the network.

Where you see it: Mixture of Experts (MoE) models route tokens to specialist sub-networks ("experts"). In a distributed MoE:

Token routing decides which expert handles which token
Tokens must move from the GPU that holds them to the GPU that holds the right expert
After the expert processes the token, the result must come back

Every layer of a MoE model has two AllToAll operations (send to expert, receive back).

Why this is hard:

AllReduce: each rank sends to the next in a ring — predictable, adjacent
AllToAll: each rank sends to every other rank — unpredictable, non-adjacent

Traffic per link depends on token routing:
  If tokens cluster to expert 0: GPU 0 receives 10× normal traffic
  Other GPUs send to GPU 0 simultaneously → incast
  ECMP cannot predict which spine carries which expert's traffic
  Adaptive routing helps but cannot eliminate the asymmetry

What you'll see on the fabric:

Highly variable link utilization — some paths carry 5× more than others within a single AllToAll
DCQCN rate reduction on hot paths, leaving other paths underutilized
ECMP imbalance is much worse for AllToAll than for ring AllReduce

How NCCL maps collectives to the fabric

NCCL doesn't just call one collective at a time. It pipelines operations and uses multiple communication "channels" in parallel:

# See NCCL's channel configuration:
NCCL_DEBUG=INFO python train.py 2>&1 | grep "Channel"

# Example output:
# NCCL INFO Channel 00/08 : 0 1 2 3 4 5 6 7    ← GPU ordering in ring 0
# NCCL INFO Channel 01/08 : 0 2 4 6 1 3 5 7    ← GPU ordering in ring 1 (interleaved)
# NCCL INFO Channel 02/08 : 0 4 1 5 2 6 3 7    ← ring 2
# ...
# NCCL INFO Channel 07/08 : ...                 ← ring 7 (8 total)

8 rings × 8 NICs = NCCL assigns one NIC per ring per direction. Each NIC carries independent AllReduce traffic for a different chunk of the gradient. This is what fully utilizes all 8 NICs on each node simultaneously.

The QP count implication:

Each NCCL channel establishes Queue Pairs with every peer:

QPs per node = channels × 2 (send + recv) × (N_nodes - 1) × QPs_per_connection

8 channels × 2 × 63 nodes × 2 QPs_per_connection = 2,016 QPs per node

At 512 GPUs (64 nodes): 2,016 × 64 = 129,024 QPs across the cluster

Each QP is an RDMA connection with its own:

Send Queue (SQ) and Receive Queue (RQ) in the RNIC
Memory registration table entry
PSN (sequence number) tracking

At 512 GPUs, initialization takes 20–120 seconds because creating and transitioning 129,024 QPs through INIT → RTR → RTS is sequential per-rank.

Collective timing at scale

How long does an AllReduce take? The simple bandwidth formula:

time = 2 × (N-1)/N × tensor_size / bandwidth_per_NIC
     ≈ 2 × tensor_size / bandwidth_per_NIC  (at large N)

Example: 7B model, 8×400G NICs, 512 GPUs
  = 2 × 28GB / (8 × 50 GB/s)
  = 56GB / 400 GB/s
  = 140ms

But actual measured: 180–250ms

Gap = ECMP imbalance + PFC pause events + DCQCN rate reduction
    + QP setup overhead + NCCL synchronization overhead

That gap — the 40–110ms between theory and practice — is the network engineer's job.

What you should be able to do now

Explain what AllReduce does and why it's nearly optimal bandwidth-wise
Describe why AllToAll is harder for the fabric than AllReduce
Interpret NCCL channel configuration output
Calculate expected AllReduce time and compare it to measured time
Explain why QP count at scale causes slow NCCL initialization

Where it breaks

AllReduce stall — one rank never arriving

One GPU process crashes or hangs during the backward pass. It never calls into the AllReduce. The other N-1 GPUs enter the AllReduce and wait. Forever — or until NCCL_TIMEOUT fires.

# Tune NCCL timeout (seconds):
export NCCL_TIMEOUT=1800   # 30 minutes (default varies by framework)

# Watch for:
# [W ProcessGroupNCCL.cpp] Timeout at NCCL work ...
# This means one rank died. Check the node for OOM, GPU error, hardware failure.

AllToAll imbalance tanking utilization

MoE training at 256+ GPUs. Some experts are much more popular than others ("hot experts"). The GPUs holding those experts receive 5–10× normal AllToAll traffic. Their ToR ports hit 90% utilization. DCQCN reduces all senders' rates. The other GPUs, sending to non-hot experts, now also send slowly — even though their paths aren't congested.

# Check per-spine utilization:
# (on your switch CLI — Arista example)
show interfaces ethernet 1/1 counters rates
# If one spine is at 90% and others at 30%, you have hot-expert imbalance.

# Mitigation: expert load balancing (auxiliary load balancing loss term)
# or capacity-based token routing in the model code — this is an ML problem,
# but you need to recognize it from the network side.

What's next

You now know what the collectives are and what they produce on the wire. The next question is: how does parallelism strategy change which collectives run, how often, and on which paths?

Next → Parallelism Strategies

The four collectives you'll encounter​

AllReduce — the one that matters most​

AllGather — ZeRO and tensor parallelism​

ReduceScatter — the other half of AllReduce​

AllToAll — the MoE collective​

How NCCL maps collectives to the fabric​

Collective timing at scale​

What you should be able to do now​

Where it breaks​

AllReduce stall — one rank never arriving​

AllToAll imbalance tanking utilization​

What's next​