Skip to main content

Parallelism Strategies

:::info Bridge — what you already know When you design a fabric, you need to know the traffic matrix. What's talking to what? How often? How much? Parallelism strategy is the ML team's way of answering that question — and they often don't realize they're answering a network question. This page translates each parallelism type into a traffic matrix. :::


Why parallelism strategy matters to the network

A 70B model has 140 billion bytes of parameters in bf16. A single H100 SXM has 80GB of HBM. The model doesn't fit on one GPU. So it must be distributed across multiple GPUs — and there are multiple ways to do that.

Each way makes a different demand on the network:

StrategyWhat's distributedCollectiveFrequencyScopeNetwork domain
Data parallelThe data batchesAllReduce (gradients)Once per stepAll ranksCross-node Ethernet
Tensor parallelThe weight matricesAllGather + ReduceScatterOnce per transformer layerTP groupIntra-node NVLink (usually)
Pipeline parallelThe model layersPoint-to-point activationsOnce per microbatch per layerPP groupCross-node Ethernet
Expert parallelThe MoE expertsAllToAllTwice per MoE layerEP groupCross-node Ethernet

The real world combines all four — this is called 4D parallelism or 3D parallelism (without expert). When they're combined, the fabric has to carry all of these traffic patterns simultaneously, with different frequency and scope.


Data parallelism (DP)

What it is: Run the same full model on every GPU. Each GPU gets a different slice of the training data. After the backward pass, AllReduce the gradients so every GPU updates its model identically.

The traffic pattern:

Step 0:
All N GPUs: forward pass (no network)
All N GPUs: backward pass (no network)
All N GPUs simultaneously: AllReduce gradients ← the burst

Step 1: repeat

What the fabric sees:

  • One large synchronized burst at the end of every step
  • Every GPU sending to every other GPU simultaneously
  • The all-to-all synchronized incast — the worst case for switch buffers
  • PyTorch DDP pipelines this: starts AllReduce per gradient bucket as backward pass proceeds, not all at once at the end

Scale:

DP degree = total_GPUs / (TP_degree × PP_degree)
If TP=1, PP=1: DP = total_GPUs → all GPUs AllReduce together

512 GPUs, pure data parallel:
One AllReduce, all 512 ranks
Gradient size: 28GB (7B model, fp32)
Expected time: ~140ms at 8×400G NICs
Actual: 180–250ms with fabric overhead

NVLink vs Ethernet in DP: Pure data parallelism on intra-node GPUs uses NVLink (900 GB/s) not Ethernet. Inter-node data parallelism uses the RDMA fabric. NCCL handles both transparently — it picks the fastest path for each GPU pair.


Tensor parallelism (TP)

What it is: Slice a single weight matrix across multiple GPUs. Each GPU holds a column (or row) of the matrix. Each GPU computes its partial result, then the results are combined with an AllGather or ReduceScatter.

Example — transformer attention:

Full attention weight: [hidden_size × hidden_size] = [4096 × 4096]

With TP=4, each GPU holds: [4096 × 1024] (one column group)

GPU 0: input × W_0 = partial_output_0
GPU 1: input × W_1 = partial_output_1
GPU 2: input × W_2 = partial_output_2
GPU 3: input × W_3 = partial_output_3

AllGather: all 4 GPUs get [partial_output_0, partial_output_1, partial_output_2, partial_output_3]
→ full output reconstructed on all 4 GPUs

The traffic pattern:

For every transformer layer (there are N of them in the model):
AllGather (or ReduceScatter) within the TP group

A 70B model has 80 transformer layers.
TP=8: 80 × 2 collectives per step = 160 collective operations per step
Each collective: tensor_size / TP_degree = small (GBs, not tens of GBs)
Frequency: very high (one per layer, per forward and backward pass)

Critical network property of TP: TP almost always runs within a single node on NVLink, not on Ethernet.

Why:
NVLink bandwidth: 900 GB/s bidirectional
Ethernet bandwidth: 8 × 400G = 3.2 Tbps = 400 GB/s bidirectional
NVLink is 2.25× faster and has sub-microsecond latency

TP collective size: small (1/8 of a layer's activations)
TP latency-sensitivity: high (160 per step — adds up)

If TP crosses node boundaries: latency dominates, training slows significantly

The rule: TP degree should never exceed the number of GPUs per node unless you've exhausted other options. TP=8 fills one 8-GPU node. TP=16 requires crossing to a second node over Ethernet — avoid this.

When TP does cross nodes: At very large models (>500B parameters), tensor parallel degree must exceed 8. You'll see TP=16 or TP=32 on H100 NVL72 clusters where 72 GPUs are connected via NVSwitch. On standard 8-GPU-per-node clusters, TP>8 means Ethernet — expect significant slowdown.


Pipeline parallelism (PP)

What it is: Slice the model's layers across nodes. Node 0 runs layers 0–19. Node 1 runs layers 20–39. Etc. Data flows as a pipeline: node 0 processes a microbatch and sends the output ("activation") to node 1, which processes it and sends to node 2.

The traffic pattern:

Microbatch 0: Node0 → Node1 → Node2 → Node3 (forward)
Node3 → Node2 → Node1 → Node0 (backward, sending gradients)

Microbatch 1: Node0 begins while Node1 is still processing Microbatch 0
Microbatch 2: Node0 begins while Node1 is on Microbatch 1, Node2 on Microbatch 0

Steady state (pipeline full):
Node0: processing microbatch N
Node1: processing microbatch N-1
Node2: processing microbatch N-2
Node3: processing microbatch N-3

What the fabric sees:

  • Point-to-point flows between consecutive pipeline stages
  • Traffic between stage N and stage N+1 only — not all-to-all
  • Very predictable traffic matrix: you know exactly which nodes talk to which
  • Activation tensor size (what moves between stages) is much smaller than the full gradient
Activation size (what moves between pipeline stages):
= batch_size × sequence_length × hidden_size
= 2048 × 4096 × 4096 × 2 bytes (bf16)
= 68 GB per microbatch

Gradient sent backward (same size):
= 68 GB

At 400G NIC: 68GB / 50 GB/s = 1.36 seconds per microbatch
With multiple microbatches in flight: latency-pipelined

The pipeline bubble problem:

Ideal (fully pipelined):
F F F F F F F F B B B B B B B B (F=forward, B=backward)

Actual (with pipeline bubbles):
F _ _ _ F _ _ _ B _ _ _ B _ _ _ (_ = GPU idle, bubble)

Bubble fraction = (PP_degree - 1) / microbatches_per_batch

More microbatches per batch → smaller bubble → better GPU utilization → more traffic per step.

Network impact of pipeline parallelism:

PP is the kindest parallelism to the network — it doesn't produce all-to-all traffic. The traffic matrix is a chain: 0→1→2→3. ECMP doesn't need to distribute across many spines because the traffic isn't all-to-all. PFC/DCQCN issues with pipeline parallelism are typically caused by the simultaneous cross-node flows during the backward pass, not the forward pass.

Where PP sits in the RDMA fabric: Pipeline stages communicate over point-to-point RDMA WRITE or SEND operations. NCCL handles this transparently. The key NCCL env var: NCCL_P2P_LEVEL controls whether P2P goes via NVLink or the fabric.


Expert parallelism / Mixture of Experts (MoE)

What it is: Instead of one monolithic FFN (Feed-Forward Network) in each transformer layer, a MoE model has N expert FFNs. A routing function decides which 2 (or K) experts handle each token. Only those experts compute; the others are idle for that token.

Why ML engineers use MoE: More parameters without proportionally more compute. A 64-expert MoE with the same compute budget as a dense model can have 8–16× more parameters — meaning better model quality at the same training cost.

What this means for the network:

Dense FFN layer (no MoE):
Each GPU computes FFN for its tokens locally
No inter-GPU traffic for this layer

MoE FFN layer:
Step 1: Each GPU routes its tokens to the right expert GPUs
→ AllToAll: send tokens to expert GPUs
Step 2: Each GPU runs its assigned expert on received tokens
→ Local compute (no network)
Step 3: Send results back to origin GPUs
→ AllToAll (reverse direction)

Two AllToAll ops per MoE layer, per step.

The load imbalance problem:

Token routing is learned by the model — and it learns to favor some experts over others ("expert collapse"). When expert 0 is popular:

GPU 0 (expert 0): receives tokens from ALL other GPUs
Other GPUs: send large amounts to GPU 0, small amounts elsewhere

Result:
GPU 0's NIC: running at 90% RX utilization (incast)
Other spines: underloaded
ECMP can't help — the imbalance is in the destination, not the path
DCQCN reduces all senders' rates
Other AllToAll paths (to uncongested experts) also slow down

Detection:

# If you have per-flow telemetry (INT):
# Look for one GPU's RX NIC counter running at 10x median

# Without INT:
ethtool -S ethX | grep rx_bytes
# Compare across nodes — uniform distribution expected
# Hot expert manifests as one node with 5x rx_bytes of others

Mitigation: This is primarily fixed in the model (auxiliary loss terms that penalize expert imbalance). But the network engineer needs to know this is happening, because it manifests as:

  • Unexplained DCQCN rate reductions
  • PFC events on one specific node's ToR port
  • Highly variable AllToAll completion times

4D parallelism — combining all four

Large model training combines all four in what's called 4D parallelism (DP + TP + PP + EP):

Example: GPT-3 style 175B model on 2048 GPUs
TP = 8 (within each 8-GPU node, on NVLink)
PP = 8 (8 pipeline stages, each 16 nodes wide)
DP = 32 (32-way data parallel across DP groups)
EP = 8 (8 MoE experts per TP group)

Node layout:
128 nodes organized in 8 PP stages of 16 nodes each
Within each PP stage: 16 nodes × 8 GPUs = 128 GPUs
TP=8 runs on NVLink within each node
DP=32 AllReduces gradients within each PP-stage group
EP=8 AllToAll runs across the 8 PP stages

What the fabric sees simultaneously:

  1. DP AllReduce: within each PP stage, 128 GPUs AllReduce gradients — cross-node, all-to-all within the DP group
  2. PP point-to-point: activation tensors flowing between 8 stages — chain pattern
  3. EP AllToAll: tokens moving between expert groups — all-to-all, variable load
  4. TP AllGather/ReduceScatter: on NVLink within nodes — not on Ethernet

The fabric must absorb all of these simultaneously. This is why fabric design decisions (oversubscription, adaptive routing, DCQCN tuning) matter so much at this scale.


The traffic matrix by parallelism strategy

Use this when talking to ML teams about what fabric they need:

ParallelismTraffic matrixFan-outLoad predictabilityPFC riskECMP suitability
Data parallelAll-to-all (within DP group)N-1High (same every step)MediumGood (many small flows)
Tensor parallelAll-to-all (within TP group, intra-node)TP-1HighLow (NVLink)N/A (NVLink)
Pipeline parallelChain (stage i → stage i+1)1High (predictable)LowGood
Expert parallelAll-to-all (token routing)N-1Low (routing varies)HighPoor (hot experts)

What you should be able to do now

  • Given a model size and GPU count, describe which parallelism combination is likely in use
  • Map each parallelism type to its traffic pattern and identify which paths it uses
  • Explain why tensor parallelism should stay on NVLink
  • Describe what expert imbalance looks like in NIC counters
  • Tell an ML engineer what traffic matrix their parallelism choice creates

What's next

You've seen what the network has to carry and how. The last page in this section covers MFU in depth — the metric that tells you whether the network is succeeding or failing — and how to diagnose exactly where the bottleneck is.

Next → MFU and Diagnosis