Skip to main content

Why the Network Matters

AI training does not happen on one machine. The models are too big, the data is too vast, the math is too slow. Training happens across hundreds or thousands of GPUs spread across a network. Your network.

This page is what it costs you when something goes wrong.


Distributed training in 60 seconds

The setup: 1,000 servers, 8 GPUs each. That's 8,000 GPUs training one model together by splitting the work.

Each GPU processes a different slice of the training data, computes its own gradient, and then — here's where you come in — every GPU has to share its gradient with every other GPU before anyone can take the next step.

This synchronization operation is AllReduce. Think of it as an OSPF flood: every router has to have every other router's update before it can converge. No exceptions. No shortcuts.


The numbers that should keep you up at night

GPT-3 scale, doing the math:

MetricValue
Model parameters175 billion
Bytes per parameter4 (FP32) or 2 (BF16)
Gradient size per sync350 GB – 700 GB
Sync frequencyevery 2–5 seconds
NIC speed per GPU server400 Gbps (= 50 GB/s)
Time to move 700 GB at 400 Gbps14 seconds
But sync must happen every2–5 seconds

There's the inversion. You need to move 700 GB in 2 seconds; a single 400G NIC takes 14. That's why GPU servers have 8 NICs. That's why every NIC is 400 Gbps moving to 800. The network is the bottleneck.


One dropped packet stalls everything

In a web app, a dropped packet costs one user 200 ms and nobody notices.

In AI training, a dropped packet stalls everyone. The trace:

  1. GPU-7 on host-342 sends a gradient fragment to GPU-3 on host-891
  2. One packet in that fragment gets dropped
  3. RDMA doesn't quietly retransmit like TCP does (see Transport & CC for why)
  4. The RDMA operation either go-back-N's millions of bytes or fails the whole transfer
  5. GPU-3 can't finish its part of AllReduce until that fragment arrives
  6. No GPU can start the next training step until AllReduce finishes
  7. All 8,000 GPUs sit idle, burning electricity, waiting for one retransmit

This is why the fabric has to be lossless — not "low loss," not "99.9%." Lossless. Zero drops.


The cost of waiting

Cluster sizeCost per hourPer minutePer second
256 H100 GPUs$7,680$128$2.13
1,024 H100 GPUs$30,720$512$8.53
4,096 H100 GPUs$122,880$2,048$34.13

At ~$30/hour per H100, a 1,024-GPU cluster burns $512 every minute it sits idle. A 10-second network stall costs $85. A 1-minute outage costs $512. A congestion event that degrades training by 10% for an hour wastes $3,072.

Your queue-depth settings have a dollar value. Your PFC threshold has a dollar value. Your ECMP hash distribution has a dollar value. Engineering decisions you used to make on intuition are now financial decisions.


Why lossless: PFC and ECN

Two technologies make lossless Ethernet possible. You met them in Section 1; here's the motivation in one place:

  • PFC (Priority Flow Control — IEEE 802.1Qbb) — when a switch buffer fills, it sends a PAUSE frame upstream telling the sender to stop on that priority class. No drops. The trade-off: aggressive PFC causes head-of-line blocking and can deadlock. You'll tune this in the Switch QoS section.
  • ECN (Explicit Congestion Notification — RFC 3168) — the switch marks packets with a congestion bit instead of dropping them. The receiver echoes back; the sender slows down. Congestion without casualties.

Together: RoCE v2 — the transport this curriculum picked back in Section 1.


GPU placement is a network problem

The scheduler says "I need 8 GPUs." Simple? No. Where those GPUs land changes everything:

ScenarioWhere the GPUs areNetwork pathYour problem?
8 GPUs on Server ASame serverNVLink only (1.8 TB/s) — zero packets on your fabricNo
4 on Server A + 4 on Server B (same ToR)Same leaf switchserver → ToR → serverYes
4 on Server A + 4 on Server C (different ToR)Different leaf switchesserver → leaf → spine → leaf → serverYES

The rule: the further apart the GPUs land, the more network in the path, the slower the training, the more it costs. Scheduler placement is a network problem — and it's why rail topology exists.


The three parallelism strategies

When a model is too big for one GPU, engineers split it. How they split it changes your traffic shape.

1. Data parallelism — most common (~90% of jobs)

Every GPU has the full model. Different slices of data.

GPU0: full model + data chunk 0 ─╮
GPU1: full model + data chunk 1 ─┼── AllReduce every step
GPU2: full model + data chunk 2 ─╯

Network pattern: AllReduce — heavy, periodic, all-to-all. This is your headache. Massive, sustained, every step.

2. Tensor parallelism — within a server

Model layers are split across GPUs in one box.

GPU0: layer 1 left half ◀──▶ GPU1: layer 1 right half
(NVLink, sub-μs)

Network pattern: all-to-all, very latency-sensitive. Lives on NVLink inside the server. You don't see it on the fabric.

3. Pipeline parallelism — across servers

Different layers on different GPUs, like an assembly line.

GPU0: layers 1–4 ─▶ GPU1: layers 5–8 ─▶ GPU2: layers 9–12
(one flow) (one flow)

Network pattern: point-to-point, sequential, moderate bandwidth. More manageable than data parallelism.

The big insight: data parallelism is your problem. Tensor parallelism is NVLink's problem. Pipeline parallelism is moderate.


The network engineer's cheat sheet

Training conceptWhat you see on the network
AllReduceElephant flows between server pairs, ring or tree pattern
Data loadingStorage reads — HDFS / NFS to servers, high BW, bursty
Gradient compressionSmaller flows (some teams compress before sending)
Pipeline parallelismPoint-to-point flows between specific GPU pairs
Tensor parallelismUsually NVLink inside server (invisible to you)
Checkpoint savingPeriodic huge writes to storage (every N minutes)

What you should remember

  • Distributed training = every GPU has to sync gradients every few seconds
  • Gradient sizes = hundreds of GB to multiple TB per sync — same size as the model
  • One dropped packet stalls the whole cluster. RDMA has no quiet TCP-style retransmit; drops are catastrophic.
  • GPU clusters cost $500+/minute at the 1K-GPU scale. Your queue-depth and PFC settings have dollar values.
  • PFC + ECN make lossless Ethernet possible. The Switch QoS section is where you actually configure them.
  • GPU placement is a network problem. Same-server is free; cross-ToR is your fabric's worst day.

Everyday analogy

Imagine 8,000 construction workers building a house together. Every 3 seconds, every worker stops, shares their progress with every other worker, waits for everyone to sync up, then continues. If one worker's walkie-talkie drops a message, all 7,999 stand idle until it's resent.

At $128/minute for a 256-GPU cluster (or $2,048/minute for 4,096 GPUs), your network is the walkie-talkie system. A crackly signal costs real money.


Next: GPU vs CPU → — the machine on each end of that AllReduce, and why GPUs reshaped the network around them.