Skip to main content

From CPU to GPU — what your fabric inherited

You've spent 20 years tuning networks around CPUs — small flows, decorrelated, TCP retransmits handle the occasional drop. AI runs on GPUs, and the network around them had to change because of it. This page is the why, and what it did to your wire.

After this page, you'll be able to
  1. Explain why CPUs can't do AI — the four walls (compute, memory, model fit, parallelism) — and the network angle hidden in each.
  2. Contrast web traffic vs AI backend traffic — different per-flow size, different sync, different blast radius. Same wires.
  3. Recite the rulea collective finishes when the slowest GPU finishes. Tail latency dominates everything else.
📖 GlossaryCPU / GPU / fabric terms — covers both pages of this chapterclick to collapse · keep open while you read
TermWhat It MeansNetworking Analogy
🧮SM (Streaming Multiprocessor)The execution unit on a GPU. H100 has 132 SMs × 128 CUDA cores eachA line card on a chassis switch — many in parallel
HBM (High-Bandwidth Memory)RAM soldered directly to the GPU die. HBM3e ≈ 3 TB/sTCAM glued to the silicon, ×30 the bandwidth
🔗NVLinkNVIDIA's GPU-to-GPU interconnect. 900 GB/s per GPU on H100, 1.8 TB/s on B100The scale-up fabric — inside-the-box, TB/s
🌀NVSwitchThe switch silicon that wires NVLink into an all-to-all meshA non-blocking crossbar at the chip level
🚌PCIe Gen5 ×16CPU ↔ GPU ↔ NIC bus inside a server. ~128 GB/sThe backplane between blades
📋SXM5The H100 form factor — module that plugs into a baseboardA blade form factor, not a card
🏠HGX / MGX / OAMBaseboard / chassis standards that hold 4–8 GPUsA reference chassis spec
🌊CollectiveA group communication primitive — AllReduce, AllGather, ReduceScatter, AllToAllThe AI version of LSA flooding
🎯Tail latencyThe p99 / p99.9 / p99.99 latency — where the slowest 1-in-10,000 packet landsThe outlier — in AI, it dominates everything
📉Hash polarizationWhen ECMP collapses a small set of elephant flows onto the same pathOnce it polarizes, it stays stuck
🚧OversubscriptionLess uplink bandwidth than downlink. Standard 4:1 in DC, intolerable in AIThe cost-saver that died with AllReduce
🚏Rail-optimized1 NIC per GPU, each pair on the same NUMA nodeStrict 1:1 mapping between compute and uplink
🧠Branch predictionCPU trick: speculate which way an if-statement goesGPUs don't bother — they run all paths in parallel

1. Why GPUs, not CPUs

A CPU isn't slow at AI. A CPU is the wrong tool entirely. Four walls hit at once:

Why CPUs can't do AI — four walls of inefficiency. Wall 1: Compute Volume — GPT-3 training needs 3 × 10²³ FLOPs, a CPU delivers ~1 TFLOPS, a GPU cluster delivers hundreds of TFLOPS; six orders of magnitude mismatch shown as a mountain. Wall 2: The Memory Wall — CPU + DDR5 RAM at 100 GB/s vs GPU + HBM3e VRAM at 3 TB/s, 30× throughput gap, CPU cores starved. Wall 3: Model Doesn't Fit & Scaling — 700 GB GPT-3 model doesn't fit on one CPU; multi-GPU cluster uses NVLink fabric at 1.8 TB/s. Wall 4: Architectural Parallelism Mismatch — CPU has deep caches, branch prediction, out-of-order execution for branch-heavy decision code (CPU tricks wasted on AI); GPU has SIMT, warp-wide parallelism, thousands of uniform threads for matrix math (GPU tricks do the job). Bottom band: the network connection — CPU + kernel + TCP stack is a bottleneck because the CPU touches every packet; RDMA + kernel bypass enables cluster flow at 700 GB AllReduce every 2 seconds.

In one frame:

  • Wall 1 — Math volume. GPT-3 training ≈ 3 × 10²³ FLOPs. A CPU delivers ~1 TFLOPS on matrix workloads. Single-CPU time: ~10,000 years. GPU cluster: weeks. Six orders of magnitude gap.
  • Wall 2 — Memory bandwidth. DDR5 ~100 GB/s vs HBM3e ~3 TB/s. 30× the throughput. Even with 1,000× the cores, a CPU's RAM bus would starve them.
  • Wall 3 — Model doesn't fit. GPT-3 = 700 GB. Spreading it across CPUs means moving most of it over the network every step — and CPUs at PCIe speed can't move 700 GB in 2 seconds.
  • Wall 4 — Parallelism mismatch. CPUs are built for branchy, decision-heavy code. AI training is wide, uniform, branchless matrix math. Every CPU trick is wasted; every GPU trick (SIMT, warp-wide parallelism, HBM) is exactly the job.

And here's the network angle (the bottom band of the diagram): even if you scaled CPU cores for the math, the CPU couldn't keep up with the data movement. A 700 GB AllReduce hits the NIC every 2 seconds; a CPU touching every packet through the kernel is the bottleneck before the math is. RDMA exists because CPUs couldn't even handle the packet path. Kernel bypass isn't a nice-to-have — it's the only way the cluster runs at all.


2. What changed on your wire

Same wires. Two completely different problems:

Web traffic vs AI backend traffic, split-panel illustration. Left panel — Web Traffic (The Busy Restaurant): independent micro-flows, small self-paced orders, traffic consists of small packets (bytes to KB) that move independently without synchronization, high loss tolerance with retries being low-stakes (a dropped packet results in a simple retry with minimal impact on other users), CPU-centric efficiency using standard TCP and ECMP, traditionally CPUs manage these flows with 4:1 oversubscription. Mini-table at bottom: flow size bytes to KB, no sync (independent), blast radius one user with 200ms delay. Right panel — AI Backend Traffic (The Synchronized Assembly Plant): synchronized 'elephant' flows in massive all-to-all sync, every GPU must exchange data up to 700 GB per cycle simultaneously every few seconds, zero-loss requirement where one drop stalls the cluster (a single stalled flow halts thousands of GPUs costing hundreds of dollars per minute), GPU-Direct via RDMA because CPUs can't keep up, AI requires RDMA to bypass CPU overhead. Mini-table at bottom: flow size gigabytes, sync all-to-all synchronous, blast radius 8,000+ GPUs with job stalls. Center divider reads vertically: 'Same wires. Different physics.'

  • Web traffic = a busy restaurant. Hundreds of customers, each ordering something different, each at their own pace. Independent. Drops are recoverable. The data plane you've spent 20 years tuning.
  • AI backend traffic = an assembly plant. 8,000 workers building ONE car together. Every 2 seconds a bell rings, every worker syncs at the central yard, runs back, works 2 seconds, bell rings again. For weeks. One slow worker stalls all 7,999. Your new data plane.

The seven things that flip:

Web (restaurant)AI backend (assembly plant)
Per-flow sizeBytes to KBGigabytes to terabytes
Sync between flowsNoneAll flows must finish at the same instant
Loss toleranceDrop = retry, nobody noticesDrop = whole job stalls
ECMP behaviorStatistically balancedPolarizes onto a few elephant flows
Oversubscription4:1 normal1:1 mandatory
Buffer absorptionSoaks up burstsFills in microseconds — PFC fires
Failure blast radiusOne user, ~200 msAll 8,000 GPUs, $/sec burning

The bell ringing every 2 seconds is your AllReduce. Every GPU sends its gradient to every other GPU. No skipping. No buffering for later. The gradient is the same size as the model:

ModelGradient per step
Llama 2 7B28 GB
GPT-3 175B700 GB
Llama 3 405B1.6 TB

For GPT-3, that's 700 GB synchronized across every GPU, every 2–5 seconds, for weeks. A collective finishes when the slowest GPU finishes — one slow link stalls thousands.

This is the inversion that broke 30 years of fabric design intuition:

You used to optimize for average throughput. Now you optimize for worst-case latency.

The p99.99 packet is the one you have to engineer for. (Full treatment of the traffic shape in What AI Does to Your Network.)


💡 What you should remember

🚫CPUs can't do AIFour walls: ⛰️ math volume, 🚧 memory bandwidth, 📦 model doesn't fit, 🧠 wrong parallelism shape.
🔗The traffic shiftedWeb = many small independent flows. AI = a few synchronized elephant flows that finish together.
🎯Tail latency is THE latencyA collective finishes when the slowest GPU finishes — one slow link stalls thousands.
💥0.1% loss = 10× throughput hitRDMA has no graceful retransmit. Loss is catastrophic, not "fine."
🚧Oversubscription died4:1 was normal in your DC. AI fabrics are 1:1. No exceptions.

Next: Inside GPU Anatomy → — open the box. The GPU vendor landscape, NVLink, NVSwitch, RDMA NICs, and a component-by-component walk of the DGX H100 reference design.