Parallelism Strategies and Their Collectives

How a model is split across GPUs decides which collective runs — and therefore what your fabric has to carry.

After this page, you'll be able to

Match parallelism to collective — data → AllReduce, pipeline → Send/Recv, MoE → AllToAll, and the rest.
Predict the traffic shape a job will produce from how it's parallelised.
Spot the hard case — why expert parallelism (MoE) is the worst thing your fabric can be handed.

Large-model training rarely uses one strategy — it stacks several at once. Each has its own communication pattern and its own set of collectives.

The strategies, and what they put on the wire

Parallelism type	What's distributed	Communication pattern
Data parallelism	Training data — each GPU holds the full model	AllReduce gradients after each step
Tensor parallelism	Weight matrices split across GPUs	AllReduce + AllGather within each layer
Pipeline parallelism	Model layers split across GPUs	Point-to-point Send/Recv at stage boundaries
Expert parallelism (MoE)	Sparse expert routing	AllToAll — each token routed to its expert GPU
Sequence parallelism	Sequence length split across GPUs	AllGather for attention, ReduceScatter for output
ZeRO (DeepSpeed)	Optimiser states + gradients sharded	AllGather forward, ReduceScatter backward

Read this table the other way and it becomes a diagnostic tool: see AllToAll dominating a trace, and you know the job is MoE; see pure AllReduce, and it's plain data parallelism.

Why MoE is the one to fear

Expert parallelism routes each token to a different "expert" GPU, so it leans on AllToAll — the collective with no ring structure, where every GPU sends to every other at once.

Traffic per link depends on token routing — some experts run hot, others stay cold.
ECMP can't predict which spine carries which expert's traffic.
Load balancing and adaptive routing help, but can't erase the asymmetry.

When one spine port sits at 90% while three others idle at 30% on an MoE job, that's almost certainly hot-expert imbalance — a model-side problem you'll be first to see. The on-the-wire detail of AllToAll lives on The Collective That Runs Every Step.

💡 What you should remember

#		Concept	Why it matters
1	🧮	The split determines the collective	Predict the traffic before the job runs
2	🔁	Data parallelism = AllReduce every step	The steady, symmetric baseline your fabric is tuned for
3	🔥	MoE = AllToAll = asymmetric, unpredictable load	Hot experts, ECMP blind spots — the fabric's hardest case

Next: When Training Slows → — MFU as the readout, the diagnosis ladder, and the reflex for "the fabric is slow" tickets. (For the lossless mechanics these collectives assume — PFC, ECN, DCQCN — see RoCE v2 and Switch QoS.)

The strategies, and what they put on the wire​

Why MoE is the one to fear​

💡 What you should remember​

The strategies, and what they put on the wire

Why MoE is the one to fear

💡 What you should remember