Parallelism Strategies and Their Collectives
How a model is split across GPUs decides which collective runs — and therefore what your fabric has to carry.
- Match parallelism to collective — data → AllReduce, pipeline → Send/Recv, MoE → AllToAll, and the rest.
- Predict the traffic shape a job will produce from how it's parallelised.
- Spot the hard case — why expert parallelism (MoE) is the worst thing your fabric can be handed.
Large-model training rarely uses one strategy — it stacks several at once. Each has its own communication pattern and its own set of collectives.
The strategies, and what they put on the wire
| Parallelism type | What's distributed | Communication pattern |
|---|---|---|
| Data parallelism | Training data — each GPU holds the full model | AllReduce gradients after each step |
| Tensor parallelism | Weight matrices split across GPUs | AllReduce + AllGather within each layer |
| Pipeline parallelism | Model layers split across GPUs | Point-to-point Send/Recv at stage boundaries |
| Expert parallelism (MoE) | Sparse expert routing | AllToAll — each token routed to its expert GPU |
| Sequence parallelism | Sequence length split across GPUs | AllGather for attention, ReduceScatter for output |
| ZeRO (DeepSpeed) | Optimiser states + gradients sharded | AllGather forward, ReduceScatter backward |
Read this table the other way and it becomes a diagnostic tool: see AllToAll dominating a trace, and you know the job is MoE; see pure AllReduce, and it's plain data parallelism.
Why MoE is the one to fear
Expert parallelism routes each token to a different "expert" GPU, so it leans on AllToAll — the collective with no ring structure, where every GPU sends to every other at once.
- Traffic per link depends on token routing — some experts run hot, others stay cold.
- ECMP can't predict which spine carries which expert's traffic.
- Load balancing and adaptive routing help, but can't erase the asymmetry.
When one spine port sits at 90% while three others idle at 30% on an MoE job, that's almost certainly hot-expert imbalance — a model-side problem you'll be first to see. The on-the-wire detail of AllToAll lives on The Collective That Runs Every Step.
💡 What you should remember
| # | Concept | Why it matters | |
|---|---|---|---|
| 1 | 🧮 | The split determines the collective | Predict the traffic before the job runs |
| 2 | 🔁 | Data parallelism = AllReduce every step | The steady, symmetric baseline your fabric is tuned for |
| 3 | 🔥 | MoE = AllToAll = asymmetric, unpredictable load | Hot experts, ECMP blind spots — the fabric's hardest case |
Next: When Training Slows → — MFU as the readout, the diagnosis ladder, and the reflex for "the fabric is slow" tickets. (For the lossless mechanics these collectives assume — PFC, ECN, DCQCN — see RoCE v2 and Switch QoS.)