Parallelism Strategies

:::info Bridge — what you already know When you design a fabric, you need to know the traffic matrix. What's talking to what? How often? How much? Parallelism strategy is the ML team's way of answering that question — and they often don't realize they're answering a network question. This page translates each parallelism type into a traffic matrix. :::

Why parallelism strategy matters to the network

A 70B model has 140 billion bytes of parameters in bf16. A single H100 SXM has 80GB of HBM. The model doesn't fit on one GPU. So it must be distributed across multiple GPUs — and there are multiple ways to do that.

Each way makes a different demand on the network:

Strategy	What's distributed	Collective	Frequency	Scope	Network domain
Data parallel	The data batches	AllReduce (gradients)	Once per step	All ranks	Cross-node Ethernet
Tensor parallel	The weight matrices	AllGather + ReduceScatter	Once per transformer layer	TP group	Intra-node NVLink (usually)
Pipeline parallel	The model layers	Point-to-point activations	Once per microbatch per layer	PP group	Cross-node Ethernet
Expert parallel	The MoE experts	AllToAll	Twice per MoE layer	EP group	Cross-node Ethernet

The real world combines all four — this is called 4D parallelism or 3D parallelism (without expert). When they're combined, the fabric has to carry all of these traffic patterns simultaneously, with different frequency and scope.

Data parallelism (DP)

What it is: Run the same full model on every GPU. Each GPU gets a different slice of the training data. After the backward pass, AllReduce the gradients so every GPU updates its model identically.

The traffic pattern:

Step 0:
  All N GPUs: forward pass (no network)
  All N GPUs: backward pass (no network)
  All N GPUs simultaneously: AllReduce gradients   ← the burst

Step 1: repeat

What the fabric sees:

One large synchronized burst at the end of every step
Every GPU sending to every other GPU simultaneously
The all-to-all synchronized incast — the worst case for switch buffers
PyTorch DDP pipelines this: starts AllReduce per gradient bucket as backward pass proceeds, not all at once at the end

Scale:

DP degree = total_GPUs / (TP_degree × PP_degree)
If TP=1, PP=1: DP = total_GPUs → all GPUs AllReduce together

512 GPUs, pure data parallel:
  One AllReduce, all 512 ranks
  Gradient size: 28GB (7B model, fp32)
  Expected time: ~140ms at 8×400G NICs
  Actual: 180–250ms with fabric overhead

NVLink vs Ethernet in DP: Pure data parallelism on intra-node GPUs uses NVLink (900 GB/s) not Ethernet. Inter-node data parallelism uses the RDMA fabric. NCCL handles both transparently — it picks the fastest path for each GPU pair.

Tensor parallelism (TP)

What it is: Slice a single weight matrix across multiple GPUs. Each GPU holds a column (or row) of the matrix. Each GPU computes its partial result, then the results are combined with an AllGather or ReduceScatter.

Example — transformer attention:

Full attention weight: [hidden_size × hidden_size] = [4096 × 4096]

With TP=4, each GPU holds: [4096 × 1024] (one column group)

GPU 0: input × W_0 = partial_output_0
GPU 1: input × W_1 = partial_output_1
GPU 2: input × W_2 = partial_output_2
GPU 3: input × W_3 = partial_output_3

AllGather: all 4 GPUs get [partial_output_0, partial_output_1, partial_output_2, partial_output_3]
→ full output reconstructed on all 4 GPUs

The traffic pattern:

For every transformer layer (there are N of them in the model):
  AllGather (or ReduceScatter) within the TP group

A 70B model has 80 transformer layers.
TP=8: 80 × 2 collectives per step = 160 collective operations per step
Each collective: tensor_size / TP_degree = small (GBs, not tens of GBs)
Frequency: very high (one per layer, per forward and backward pass)

Critical network property of TP: TP almost always runs within a single node on NVLink, not on Ethernet.

Why:
  NVLink bandwidth: 900 GB/s bidirectional
  Ethernet bandwidth: 8 × 400G = 3.2 Tbps = 400 GB/s bidirectional
  NVLink is 2.25× faster and has sub-microsecond latency

  TP collective size: small (1/8 of a layer's activations)
  TP latency-sensitivity: high (160 per step — adds up)

  If TP crosses node boundaries: latency dominates, training slows significantly

The rule: TP degree should never exceed the number of GPUs per node unless you've exhausted other options. TP=8 fills one 8-GPU node. TP=16 requires crossing to a second node over Ethernet — avoid this.

When TP does cross nodes: At very large models (>500B parameters), tensor parallel degree must exceed 8. You'll see TP=16 or TP=32 on H100 NVL72 clusters where 72 GPUs are connected via NVSwitch. On standard 8-GPU-per-node clusters, TP>8 means Ethernet — expect significant slowdown.

Pipeline parallelism (PP)

What it is: Slice the model's layers across nodes. Node 0 runs layers 0–19. Node 1 runs layers 20–39. Etc. Data flows as a pipeline: node 0 processes a microbatch and sends the output ("activation") to node 1, which processes it and sends to node 2.

The traffic pattern:

Microbatch 0: Node0 → Node1 → Node2 → Node3 (forward)
                      Node3 → Node2 → Node1 → Node0 (backward, sending gradients)

Microbatch 1: Node0 begins while Node1 is still processing Microbatch 0
Microbatch 2: Node0 begins while Node1 is on Microbatch 1, Node2 on Microbatch 0

Steady state (pipeline full):
  Node0: processing microbatch N
  Node1: processing microbatch N-1
  Node2: processing microbatch N-2
  Node3: processing microbatch N-3

What the fabric sees:

Point-to-point flows between consecutive pipeline stages
Traffic between stage N and stage N+1 only — not all-to-all
Very predictable traffic matrix: you know exactly which nodes talk to which
Activation tensor size (what moves between stages) is much smaller than the full gradient

Activation size (what moves between pipeline stages):
  = batch_size × sequence_length × hidden_size
  = 2048 × 4096 × 4096 × 2 bytes (bf16)
  = 68 GB per microbatch

Gradient sent backward (same size):
  = 68 GB

At 400G NIC: 68GB / 50 GB/s = 1.36 seconds per microbatch
With multiple microbatches in flight: latency-pipelined

The pipeline bubble problem:

Ideal (fully pipelined):
F F F F F F F F B B B B B B B B   (F=forward, B=backward)

Actual (with pipeline bubbles):
F _ _ _ F _ _ _ B _ _ _ B _ _ _   (_ = GPU idle, bubble)

Bubble fraction = (PP_degree - 1) / microbatches_per_batch

More microbatches per batch → smaller bubble → better GPU utilization → more traffic per step.

Network impact of pipeline parallelism:

PP is the kindest parallelism to the network — it doesn't produce all-to-all traffic. The traffic matrix is a chain: 0→1→2→3. ECMP doesn't need to distribute across many spines because the traffic isn't all-to-all. PFC/DCQCN issues with pipeline parallelism are typically caused by the simultaneous cross-node flows during the backward pass, not the forward pass.

Where PP sits in the RDMA fabric: Pipeline stages communicate over point-to-point RDMA WRITE or SEND operations. NCCL handles this transparently. The key NCCL env var: NCCL_P2P_LEVEL controls whether P2P goes via NVLink or the fabric.

Expert parallelism / Mixture of Experts (MoE)

What it is: Instead of one monolithic FFN (Feed-Forward Network) in each transformer layer, a MoE model has N expert FFNs. A routing function decides which 2 (or K) experts handle each token. Only those experts compute; the others are idle for that token.

Why ML engineers use MoE: More parameters without proportionally more compute. A 64-expert MoE with the same compute budget as a dense model can have 8–16× more parameters — meaning better model quality at the same training cost.

What this means for the network:

Dense FFN layer (no MoE):
  Each GPU computes FFN for its tokens locally
  No inter-GPU traffic for this layer

MoE FFN layer:
  Step 1: Each GPU routes its tokens to the right expert GPUs
          → AllToAll: send tokens to expert GPUs
  Step 2: Each GPU runs its assigned expert on received tokens
          → Local compute (no network)
  Step 3: Send results back to origin GPUs
          → AllToAll (reverse direction)

Two AllToAll ops per MoE layer, per step.

The load imbalance problem:

Token routing is learned by the model — and it learns to favor some experts over others ("expert collapse"). When expert 0 is popular:

GPU 0 (expert 0): receives tokens from ALL other GPUs
Other GPUs: send large amounts to GPU 0, small amounts elsewhere

Result:
  GPU 0's NIC: running at 90% RX utilization (incast)
  Other spines: underloaded
  ECMP can't help — the imbalance is in the destination, not the path
  DCQCN reduces all senders' rates
  Other AllToAll paths (to uncongested experts) also slow down

Detection:

# If you have per-flow telemetry (INT):
# Look for one GPU's RX NIC counter running at 10x median

# Without INT:
ethtool -S ethX | grep rx_bytes
# Compare across nodes — uniform distribution expected
# Hot expert manifests as one node with 5x rx_bytes of others

Mitigation: This is primarily fixed in the model (auxiliary loss terms that penalize expert imbalance). But the network engineer needs to know this is happening, because it manifests as:

Unexplained DCQCN rate reductions
PFC events on one specific node's ToR port
Highly variable AllToAll completion times

4D parallelism — combining all four

Large model training combines all four in what's called 4D parallelism (DP + TP + PP + EP):

Example: GPT-3 style 175B model on 2048 GPUs
  TP = 8   (within each 8-GPU node, on NVLink)
  PP = 8   (8 pipeline stages, each 16 nodes wide)
  DP = 32  (32-way data parallel across DP groups)
  EP = 8   (8 MoE experts per TP group)

Node layout:
  128 nodes organized in 8 PP stages of 16 nodes each
  Within each PP stage: 16 nodes × 8 GPUs = 128 GPUs
  TP=8 runs on NVLink within each node
  DP=32 AllReduces gradients within each PP-stage group
  EP=8 AllToAll runs across the 8 PP stages

What the fabric sees simultaneously:

DP AllReduce: within each PP stage, 128 GPUs AllReduce gradients — cross-node, all-to-all within the DP group
PP point-to-point: activation tensors flowing between 8 stages — chain pattern
EP AllToAll: tokens moving between expert groups — all-to-all, variable load
TP AllGather/ReduceScatter: on NVLink within nodes — not on Ethernet

The fabric must absorb all of these simultaneously. This is why fabric design decisions (oversubscription, adaptive routing, DCQCN tuning) matter so much at this scale.

The traffic matrix by parallelism strategy

Use this when talking to ML teams about what fabric they need:

Parallelism	Traffic matrix	Fan-out	Load predictability	PFC risk	ECMP suitability
Data parallel	All-to-all (within DP group)	N-1	High (same every step)	Medium	Good (many small flows)
Tensor parallel	All-to-all (within TP group, intra-node)	TP-1	High	Low (NVLink)	N/A (NVLink)
Pipeline parallel	Chain (stage i → stage i+1)	1	High (predictable)	Low	Good
Expert parallel	All-to-all (token routing)	N-1	Low (routing varies)	High	Poor (hot experts)

What you should be able to do now

Given a model size and GPU count, describe which parallelism combination is likely in use
Map each parallelism type to its traffic pattern and identify which paths it uses
Explain why tensor parallelism should stay on NVLink
Describe what expert imbalance looks like in NIC counters
Tell an ML engineer what traffic matrix their parallelism choice creates

What's next

You've seen what the network has to carry and how. The last page in this section covers MFU in depth — the metric that tells you whether the network is succeeding or failing — and how to diagnose exactly where the bottleneck is.

Next → MFU and Diagnosis

Why parallelism strategy matters to the network​

Data parallelism (DP)​

Tensor parallelism (TP)​

Pipeline parallelism (PP)​

Expert parallelism / Mixture of Experts (MoE)​

4D parallelism — combining all four​

The traffic matrix by parallelism strategy​

What you should be able to do now​

What's next​

Why parallelism strategy matters to the network

Data parallelism (DP)

Tensor parallelism (TP)

Pipeline parallelism (PP)

Expert parallelism / Mixture of Experts (MoE)

4D parallelism — combining all four

The traffic matrix by parallelism strategy

What you should be able to do now

What's next