Why the Network Matters

AI training does not happen on one machine. The models are too big, the data is too vast, the math is too slow. Training happens across hundreds or thousands of GPUs spread across a network. Your network.

This page is what it costs you when something goes wrong.

Distributed training in 60 seconds

The setup: 1,000 servers, 8 GPUs each. That's 8,000 GPUs training one model together by splitting the work.

Each GPU processes a different slice of the training data, computes its own gradient, and then — here's where you come in — every GPU has to share its gradient with every other GPU before anyone can take the next step.

This synchronization operation is AllReduce. Think of it as an OSPF flood: every router has to have every other router's update before it can converge. No exceptions. No shortcuts.

The numbers that should keep you up at night

GPT-3 scale, doing the math:

Metric	Value
Model parameters	175 billion
Bytes per parameter	4 (FP32) or 2 (BF16)
Gradient size per sync	350 GB – 700 GB
Sync frequency	every 2–5 seconds
NIC speed per GPU server	400 Gbps (= 50 GB/s)
Time to move 700 GB at 400 Gbps	14 seconds
But sync must happen every	2–5 seconds

There's the inversion. You need to move 700 GB in 2 seconds; a single 400G NIC takes 14. That's why GPU servers have 8 NICs. That's why every NIC is 400 Gbps moving to 800. The network is the bottleneck.

One dropped packet stalls everything

In a web app, a dropped packet costs one user 200 ms and nobody notices.

In AI training, a dropped packet stalls everyone. The trace:

GPU-7 on host-342 sends a gradient fragment to GPU-3 on host-891
One packet in that fragment gets dropped
RDMA doesn't quietly retransmit like TCP does (see Transport & CC for why)
The RDMA operation either go-back-N's millions of bytes or fails the whole transfer
GPU-3 can't finish its part of AllReduce until that fragment arrives
No GPU can start the next training step until AllReduce finishes
All 8,000 GPUs sit idle, burning electricity, waiting for one retransmit

This is why the fabric has to be lossless — not "low loss," not "99.9%." Lossless. Zero drops.

The cost of waiting

Cluster size	Cost per hour	Per minute	Per second
256 H100 GPUs	$7,680	$128	$2.13
1,024 H100 GPUs	$30,720	$512	$8.53
4,096 H100 GPUs	$122,880	$2,048	$34.13

At ~$30/hour per H100, a 1,024-GPU cluster burns $512 every minute it sits idle. A 10-second network stall costs $85. A 1-minute outage costs $512. A congestion event that degrades training by 10% for an hour wastes $3,072.

Your queue-depth settings have a dollar value. Your PFC threshold has a dollar value. Your ECMP hash distribution has a dollar value. Engineering decisions you used to make on intuition are now financial decisions.

Why lossless: PFC and ECN

Two technologies make lossless Ethernet possible. You met them in Section 1; here's the motivation in one place:

PFC (Priority Flow Control — IEEE 802.1Qbb) — when a switch buffer fills, it sends a PAUSE frame upstream telling the sender to stop on that priority class. No drops. The trade-off: aggressive PFC causes head-of-line blocking and can deadlock. You'll tune this in the Switch QoS section.
ECN (Explicit Congestion Notification — RFC 3168) — the switch marks packets with a congestion bit instead of dropping them. The receiver echoes back; the sender slows down. Congestion without casualties.

Together: RoCE v2 — the transport this curriculum picked back in Section 1.

GPU placement is a network problem

The scheduler says "I need 8 GPUs." Simple? No. Where those GPUs land changes everything:

Scenario	Where the GPUs are	Network path	Your problem?
8 GPUs on Server A	Same server	NVLink only (1.8 TB/s) — zero packets on your fabric	No
4 on Server A + 4 on Server B (same ToR)	Same leaf switch	server → ToR → server	Yes
4 on Server A + 4 on Server C (different ToR)	Different leaf switches	server → leaf → spine → leaf → server	YES

The rule: the further apart the GPUs land, the more network in the path, the slower the training, the more it costs. Scheduler placement is a network problem — and it's why rail topology exists.

The three parallelism strategies

When a model is too big for one GPU, engineers split it. How they split it changes your traffic shape.

1. Data parallelism — most common (~90% of jobs)

Every GPU has the full model. Different slices of data.

GPU0:  full model  +  data chunk 0  ─╮
GPU1:  full model  +  data chunk 1  ─┼── AllReduce every step
GPU2:  full model  +  data chunk 2  ─╯

Network pattern: AllReduce — heavy, periodic, all-to-all. This is your headache. Massive, sustained, every step.

2. Tensor parallelism — within a server

Model layers are split across GPUs in one box.

GPU0:  layer 1 left half  ◀──▶  GPU1:  layer 1 right half
                          (NVLink, sub-μs)

Network pattern: all-to-all, very latency-sensitive. Lives on NVLink inside the server. You don't see it on the fabric.

3. Pipeline parallelism — across servers

Different layers on different GPUs, like an assembly line.

GPU0: layers 1–4  ─▶  GPU1: layers 5–8  ─▶  GPU2: layers 9–12
       (one flow)        (one flow)

Network pattern: point-to-point, sequential, moderate bandwidth. More manageable than data parallelism.

The big insight: data parallelism is your problem. Tensor parallelism is NVLink's problem. Pipeline parallelism is moderate.

The network engineer's cheat sheet

Training concept	What you see on the network
AllReduce	Elephant flows between server pairs, ring or tree pattern
Data loading	Storage reads — HDFS / NFS to servers, high BW, bursty
Gradient compression	Smaller flows (some teams compress before sending)
Pipeline parallelism	Point-to-point flows between specific GPU pairs
Tensor parallelism	Usually NVLink inside server (invisible to you)
Checkpoint saving	Periodic huge writes to storage (every N minutes)

What you should remember

Distributed training = every GPU has to sync gradients every few seconds
Gradient sizes = hundreds of GB to multiple TB per sync — same size as the model
One dropped packet stalls the whole cluster. RDMA has no quiet TCP-style retransmit; drops are catastrophic.
GPU clusters cost $500+/minute at the 1K-GPU scale. Your queue-depth and PFC settings have dollar values.
PFC + ECN make lossless Ethernet possible. The Switch QoS section is where you actually configure them.
GPU placement is a network problem. Same-server is free; cross-ToR is your fabric's worst day.

Everyday analogy

Imagine 8,000 construction workers building a house together. Every 3 seconds, every worker stops, shares their progress with every other worker, waits for everyone to sync up, then continues. If one worker's walkie-talkie drops a message, all 7,999 stand idle until it's resent.

At $128/minute for a 256-GPU cluster (or $2,048/minute for 4,096 GPUs), your network is the walkie-talkie system. A crackly signal costs real money.

Next: GPU vs CPU → — the machine on each end of that AllReduce, and why GPUs reshaped the network around them.

Distributed training in 60 seconds​

The numbers that should keep you up at night​

One dropped packet stalls everything​

The cost of waiting​

Why lossless: PFC and ECN​

GPU placement is a network problem​

The three parallelism strategies​

1. Data parallelism — most common (~90% of jobs)​

2. Tensor parallelism — within a server​

3. Pipeline parallelism — across servers​

The network engineer's cheat sheet​

What you should remember​

Everyday analogy​