The Collective That Runs Every Step

90% of what runs on an AI fabric is AllReduce. Understand this one and the rest are variations.

After this page, you'll be able to

Explain ring AllReduce — the two phases (Scatter-Reduce + AllGather, each N-1 steps), why each GPU sends only 2 × (N-1)/N × tensor_size (~56 GB at N=512), and why "send all to GPU 0" creates 14 TB of incast.
Describe what it looks like on the wire — ~512 simultaneous RDMA WRITE flows, ~55 MB per ring step, microburst shape, and why ECMP hash distribution and DCQCN tuning carry a dollar value.
Read NCCL's parallel rings — 8–16 channels from NCCL_DEBUG=INFO ... grep Channel, ~2,000 QPs per node, and why init takes 20–120 s.
Rank the other collectives — AllGather / ReduceScatter / Broadcast, and why AllToAll (MoE: Mixtral, DeepSeek) is your hardest problem and the 40–110 ms theory-vs-measured gap is your job to close.

What AllReduce actually does

Every GPU starts with its own gradient tensor (different values, computed from its own training data slice).

After AllReduce, every GPU holds the same tensor: the sum (or average) across all GPUs.

No GPU is "the server." No coordinator. The operation is fully distributed — by design, because anything else creates a bottleneck.

The naive version is unworkable

Try "send everything to GPU 0 and broadcast back" at 512 GPUs with 28 GB gradients:

GPU 0 receives 28 GB from every other GPU → 14 TB of incast on one NIC.
GPU 0 sends 28 GB back to every other GPU → another 14 TB outbound.

No NIC in existence can do that. NCCL doesn't try.

Ring AllReduce — what NCCL actually does

Arrange N GPUs in a logical ring: GPU0 → GPU1 → GPU2 → ... → GPU(N-1) → GPU0. Run two phases:

Phase 1 — Scatter-Reduce (N-1 steps): Each GPU sends one chunk to the next in the ring, receives one chunk from the previous. The received chunk gets summed with the local one. After N-1 steps, every GPU holds one complete reduced chunk — the sum of that chunk across all GPUs.

Phase 2 — AllGather (N-1 steps): Each GPU broadcasts its complete chunk around the ring. After N-1 steps, every GPU holds the full reduced tensor.

Watch it run. Each cell is one gradient chunk; its shade is how many GPUs' data is summed in. When every cell reads Σ, the AllReduce is done:

Startframe 1 / 7

Every GPU starts with only its own data

Each of the 4 GPUs computed gradients from a different data slice. Every chunk holds just 1 contribution. Nothing is shared yet.

contributors summed in:123all → Σ

Why this is nearly bandwidth-optimal:

Data each GPU sends = 2 × (N-1)/N × tensor_size
                    ≈ 2 × tensor_size   (at large N)

At N=512, 28 GB gradient:
  Data per GPU = ~56 GB
  Every GPU sends at NIC line rate, simultaneously.
  No bottleneck. Bandwidth-optimal.

This is what your fabric has to handle. N GPUs all sending at NIC line rate, simultaneously, in a stable adjacent-pair pattern. Every step. For weeks.

What it looks like on the wire

The moment AllReduce launches:

512 simultaneous RDMA WRITE flows (one per ring position)
Each flow ~55 MB per ring step
All flows start within microseconds of each other
Every NIC is both transmitting and receiving at line rate
Traffic shape is a microburst — peak fabric utilization, then quiet, repeat every few seconds

This is why your ECMP hash distribution matters. This is why hash polarization is catastrophic here. This is why DCQCN tuning has a dollar value.

NCCL builds many rings in parallel

NCCL doesn't build one ring — it builds 8 to 16, mapping each to a different NIC + GPU pair:

NCCL_DEBUG=INFO python train.py 2>&1 | grep "Channel"
# NCCL INFO Channel 00/08 : 0 1 2 3 4 5 6 7    ← ring 0
# NCCL INFO Channel 01/08 : 0 2 4 6 1 3 5 7    ← ring 1 (interleaved)
# NCCL INFO Channel 02/08 : 0 4 1 5 2 6 3 7
# ...

On AMD, this is identical — RCCL reuses every NCCL_* name, so NCCL_DEBUG=INFO ... | grep "Channel" prints the same ring layout. On Intel, the oneCCL equivalent is CCL_LOG_LEVEL=info (different namespace — see the callout below).

Each ring carries a different chunk of the gradient. Effect: each of the 8 NICs on a node runs an independent AllReduce in parallel — that's how 8 × 400 Gbps of host bandwidth actually gets used.

For 512 GPUs across 64 nodes, NCCL creates roughly 2,000 QP connections per node to make this work — which is why initialization can take 20–120 seconds before training even starts.

The algorithm is library-independent

Ring AllReduce is a wire pattern, not an NVIDIA feature. Every major vendor ships a collective library that implements the same Scatter-Reduce + AllGather mechanics over the same RDMA WRITE flows. From the fabric's point of view they are indistinguishable — same N simultaneous flows, same microburst shape.

Library	Vendor	Targets	Env-var namespace
NCCL	NVIDIA	CUDA GPUs	`NCCL_*`
RCCL	AMD (ROCm Communication Collectives Library)	ROCm GPUs (MI250/MI300)	`NCCL_*` — reused verbatim
oneCCL	Intel (oneAPI Collective Communications Library)	Xe / Gaudi	`CCL_*`

The one fact worth memorizing:

RCCL is a drop-in NCCL replacement

RCCL is ABI-compatible with NCCL and reuses every NCCL_* environment variable name unchanged. NCCL_DEBUG=INFO, NCCL_IB_HCA, NCCL_IB_GID_INDEX all work as-is on AMD hosts — your RoCE tuning carries over without renaming anything. oneCCL is the exception: it uses the CCL_* namespace instead (CCL_LOG_LEVEL=info, CCL_WORKER_COUNT), so don't expect NCCL_* vars to take effect on an Intel stack.

So when you see a worked example below using NCCL_DEBUG=INFO, read it as "the collective library's debug flag" — it's the same command on AMD, and the CCL_* equivalent on Intel.

The other collectives, briefly

Four more you'll meet in the wild. All variations on the same theme.

AllGather — every GPU starts with a shard, ends with the full tensor. Used by ZeRO (memory-efficient training) and tensor parallelism. Same ring mechanism as AllReduce phase 2.

ReduceScatter — every GPU starts with the full tensor, ends with one reduced shard. ZeRO-2 uses this for gradient distribution. Same as AllReduce phase 1.

Broadcast — rank 0 has a tensor, every other rank ends with rank 0's copy. Used for parameter initialization at job start. Tiny traffic, once.

AllToAll — every GPU has N chunks (one per peer) and sends each chunk to its target peer. This is the hardest one for your fabric. No ring structure. Every GPU sends to every other, all at once, with traffic volumes that depend on token routing — i.e., unpredictable.

AllToAll = the MoE problem

If the customer is training a Mixture-of-Experts model — Mixtral, GPT-4 (rumored), DeepSeek — AllToAll is the dominant collective. And it's painful:

Traffic per link depends on token routing — some experts get hot, others stay cold
ECMP can't predict which spine carries which expert's traffic
DCQCN rate reduction on hot paths leaves other paths underutilized
Adaptive routing helps but cannot eliminate the asymmetry

When one spine port is hammered at 90% while three others sit at 30%, and the training is MoE, it's almost certainly hot-expert imbalance. That's a model-side problem — but you'll be the first to spot it.

Theory vs reality

The bandwidth-optimal AllReduce time:

time ≈ 2 × tensor_size / bandwidth_per_NIC

7B model, 8 × 400G NICs, 512 GPUs:
  time = 2 × 28 GB / (8 × 50 GB/s)
       = 140 ms

But measured: 180–250 ms

That 40–110 ms gap between theoretical and measured is your job. ECMP imbalance, PFC pause events, DCQCN rate adjustments, QP setup overhead — every one of them widens the gap.

Closing that gap pays for itself in days.

💡 What you should remember

#		Concept	Why it matters
1	🔄	AllReduce is 90% of training traffic	Two phases. Ring mechanics. Bandwidth-optimal at large N.
2	⚡	N simultaneous flows, all at line rate, every few seconds	That's the shape that defines your fabric.
3	💍	Multiple rings in parallel (one per NIC)	How 8 × 400 Gbps actually gets used.
4	⚠️	AllToAll = MoE = your hardest problem	Unpredictable per-link load, hot experts, ECMP can't help.
5	📉	The 40–110 ms gap between theory and measured AllReduce	What good fabric engineering recovers.

Next: Inside the Libraries: NCCL, UCX & SHARP → — the tuning knobs on the workhorse library, and the in-network-compute trick (SHARP) that runs AllReduce on the switch ASIC itself.

What AllReduce actually does​

The naive version is unworkable​

Ring AllReduce — what NCCL actually does​

What it looks like on the wire​

NCCL builds many rings in parallel​

The algorithm is library-independent​

The other collectives, briefly​

AllToAll = the MoE problem​

Theory vs reality​

💡 What you should remember​