Skip to main content

Parallelism Strategies and Their Collectives

How a model is split across GPUs decides which collective runs — and therefore what your fabric has to carry.

After this page, you'll be able to
  1. Match parallelism to collective — data → AllReduce, pipeline → Send/Recv, MoE → AllToAll, and the rest.
  2. Predict the traffic shape a job will produce from how it's parallelised.
  3. Spot the hard case — why expert parallelism (MoE) is the worst thing your fabric can be handed.

Large-model training rarely uses one strategy — it stacks several at once. Each has its own communication pattern and its own set of collectives.


The strategies, and what they put on the wire

Parallelism typeWhat's distributedCommunication pattern
Data parallelismTraining data — each GPU holds the full modelAllReduce gradients after each step
Tensor parallelismWeight matrices split across GPUsAllReduce + AllGather within each layer
Pipeline parallelismModel layers split across GPUsPoint-to-point Send/Recv at stage boundaries
Expert parallelism (MoE)Sparse expert routingAllToAll — each token routed to its expert GPU
Sequence parallelismSequence length split across GPUsAllGather for attention, ReduceScatter for output
ZeRO (DeepSpeed)Optimiser states + gradients shardedAllGather forward, ReduceScatter backward

Read this table the other way and it becomes a diagnostic tool: see AllToAll dominating a trace, and you know the job is MoE; see pure AllReduce, and it's plain data parallelism.


Why MoE is the one to fear

Expert parallelism routes each token to a different "expert" GPU, so it leans on AllToAll — the collective with no ring structure, where every GPU sends to every other at once.

  • Traffic per link depends on token routing — some experts run hot, others stay cold.
  • ECMP can't predict which spine carries which expert's traffic.
  • Load balancing and adaptive routing help, but can't erase the asymmetry.

When one spine port sits at 90% while three others idle at 30% on an MoE job, that's almost certainly hot-expert imbalance — a model-side problem you'll be first to see. The on-the-wire detail of AllToAll lives on The Collective That Runs Every Step.


💡 What you should remember

#ConceptWhy it matters
1🧮The split determines the collectivePredict the traffic before the job runs
2🔁Data parallelism = AllReduce every stepThe steady, symmetric baseline your fabric is tuned for
3🔥MoE = AllToAll = asymmetric, unpredictable loadHot experts, ECMP blind spots — the fabric's hardest case

Next: When Training Slows → — MFU as the readout, the diagnosis ladder, and the reflex for "the fabric is slow" tickets. (For the lossless mechanics these collectives assume — PFC, ECN, DCQCN — see RoCE v2 and Switch QoS.)