Skip to main content

Why AI Networks Are Different

There's no single way to build an AI cluster. You can run it on Kubernetes or on bare metal. InfiniBand or Ethernet. RoCEv2 or UEC or Falcon. DCQCN or HPCC or Swift. Each layer of the stack has a design space, and the right answer depends on your scale, your vendors, and what you've already deployed.

This curriculum walks every one of those design spaces — what exists, who's running what, what this course picks, and why. Each section ends with one combination so you can build it; the comparison stays on the page so you can swap when your stack differs.

This section covers two of those choices: transport (how bytes move between GPUs) and congestion control (what happens when too many bytes try to move at once). Host networking, Kubernetes-or-not, GPU drivers, and topology each get their own section.

This page is the setup. It answers four questions:

  1. What is transport?
  2. What is congestion control?
  3. What of this do you already know from running enterprise / DC networks?
  4. Why does AI training break that baseline?

Once these are clear, the next two pages walk the full design space.


1. What is transport?

Transport is whatever sits on top of IP and gets your bytes from one host to another. In your day job that's TCP and UDP. In AI fabric it's something different — but the role is the same: framing, reliability, ordering, flow control, multipath, encryption, and the API the application uses to push bytes through.

In an AI training cluster, the transport carries gradients (the math output of one GPU that the others need) between GPU NICs. Every training step is a synchronized burst across thousands of GPUs. The transport decides how fast they sync — and how much CPU it costs to do it.

2. What is congestion control?

Congestion control is what the network does when more traffic is offered than a link can carry. It's not "if the link is full" — it's "the link is about to be full, and we need to decide who slows down before packets start dropping or queues blow up."

In AI fabric this matters because one congested link can stall a synchronized collective (AllReduce, AllGather) across the entire job. A 0.1% packet loss rate is fine for TCP. It's a 10× throughput hit for RDMA.


3. What you already know

You've been running both of these every day.

Transport baseline:

  • L4 protocols — TCP (reliable, ordered, connection-oriented) and UDP (best-effort, fire-and-forget)
  • Sockets APIsocket(), connect(), send(), recv() — every app talks to the kernel; the kernel talks to the NIC
  • TCP mechanics — three-way handshake, sliding window, SACK, fast retransmit, MSS / MTU, segmentation

Congestion control baseline:

  • TCP CC — slow start, congestion-avoidance, fast-retransmit; Reno → Cubic → BBR
  • ECN (RFC 3168) — switch marks the IP header at congested egress; receiver echoes back; sender slows down
  • RED / WRED — random early drop; the switch starts dropping before the queue is full
  • PFC (IEEE 802.1Qbb) — link-level pause frames per priority class; backpressure instead of drop
  • QoS toolkit — DSCP, COS, queue scheduling, buffer profiles, headroom, watermarks

That's the foundation. Every primitive above shows up again in AI fabrics — just wired into a different control loop.


4. Why AI training breaks that baseline

Three forces, all hitting at once.

400 Gbps and the kernel tax

The piece you've probably never dealt with before: kernel bypass. Your stack today says NIC → driver → kernel → socket buffer → user-space. Every hop is a tax — context switch, copy, queue. At 1 Gbps nobody cared. At 400 Gbps, with millions of packets per second, every microsecond of CPU on the wire-side is a microsecond the GPU is sitting idle waiting for gradients.

AI fabric uses RDMA — Remote Direct Memory Access. The NIC reads and writes remote memory directly; the OS is not in the path. RDMA is a technique, not a protocol — the two protocols that implement it are InfiniBand (a separate fabric) and RoCE v2 (RDMA on Ethernet). The mechanics — verbs, queue pairs, memory regions, the three operations — live in RDMA. For this page, just know that RDMA is what eliminates the CPU tax at 400 Gbps.

A training step finishes when every GPU finishes its part of the collective. The slowest link sets the pace for thousands of GPUs. TCP's "back off and retry on loss" model — fine for a web request — becomes a job-killer when a 10 ms stall multiplies across 10,000 GPUs synchronizing every few hundred milliseconds.

Eight GPUs in a ring (a collective). Seven are marked done in green; one is still working in red, with a congested link drawn as a dashed red line. Below, a timeline shows seven GPUs finishing quickly while the slow GPU's bar stretches much further — the step ends only when the slow one finishes.
A collective finishes when every GPU finishes. One slow link sets the pace for the entire job.

0.1% loss = 10× throughput hit for RDMA

RDMA was designed for lossless underlays (InfiniBand had hardware credit-based flow control from day one). When you move RDMA to Ethernet, you either fake "lossless" with PFC (pause the priority class before the buffer overflows) or you redesign the transport to tolerate loss (UEC, MRC, Falcon, SRD all do this). Either way, the everyday "ECMP hash + small buffer + occasional drop" recipe of a CLOS fabric stops working.


The new mental model

Congestion control reorganizes into three layers, each catching what the previous one couldn't:

  • CC algorithm — runs on the NIC. In steady state, dials the send rate up and down based on round-trip signals. No input from the switches; no intervention from the kernel.
  • ECN (RFC 3168) — the switch marks the IP header at congested egress before dropping. The sender's NIC sees the mark, and the CC algorithm dials the rate down proactively. No packets dropped, no link paused.
  • PFC (IEEE 802.1Qbb) — the switch sends a PAUSE frame on the priority class. The sender stops cold until the pause clears. Only fires when the CC layer was too slow.

Where TCP backs off after a packet drops, RDMA fabric uses ECN to warn the sender before drop happens, and the NIC adjusts its send rate proactively. PFC is the safety net underneath both.

Three-level escalation diagram. Level 0 (green): Normal — CC algorithm runs in steady state, no signal, no intervention. Arrow labeled 'if congestion brewing' points to Level 1 (yellow): ECN + CC — switch marks IP header at congested egress, NIC sees it, CC dials rate down proactively before drops. Arrow labeled 'if still congested' points to Level 2 (red): PFC — switch sends PAUSE on the priority class, sender stops cold. Safety net, only fires when CC was too slow.
Three layers of escalation. CC handles steady state; ECN turns CC into a proactive brake before drops happen; PFC is the safety net underneath both.

Where this leaves us

The network is no longer just packet-forwarding infrastructure — it is part of the distributed compute system itself.

The next two pages walk the actual design space — every transport that exists today, every CC algorithm in production, who built each, and what tradeoff each one is buying.

📄 Transports & Congestion Control — One-Pager — the dense reference card behind these pages. Print it, pin it to your wall.