The Field Guide

Transport + Congestion Control. The wire and the brakes of an AI fabric.

This guide covers the two layers that make an AI fabric an AI fabric: transport (how bytes move between GPUs) and congestion control (what happens when too many bytes try to move at once). Everything else in the stack — host networking, Kubernetes, GPU drivers, topology — gets covered in its own phase of the curriculum.

📄 Transports & Congestion Control — One-Pager — the dense reference card behind this page. Print it, pin it to your wall.

1. Transport

What is transport?

Transport is whatever sits on top of IP and gets your bytes from one host to another. In your day job that's TCP and UDP. In AI fabric it's something different — but the role is the same: framing, reliability, ordering, flow control, multipath, encryption, and the API the application uses to push bytes through.

In an AI training cluster, the transport carries gradients (the math output of one GPU that the others need) between GPU NICs. Every training step is a synchronized burst across thousands of GPUs. The transport decides how fast they sync — and how much CPU it costs to do it.

What you already know

You've been running this every day:

L4 protocols — TCP (reliable, ordered, connection-oriented) and UDP (best-effort, fire-and-forget)
Sockets API — socket(), connect(), send(), recv() — every app talks to the kernel; the kernel talks to the NIC
TCP mechanics — three-way handshake, sliding window, SACK, fast retransmit, MSS / MTU, segmentation
Congestion control — slow start, Reno, Cubic, BBR. Backs off when the network drops or marks packets
QoS tooling — DSCP, COS, queue-per-priority, RED/WRED, ECN marking at congested egress

The piece you've probably never dealt with before: kernel bypass. Your stack today says NIC → driver → kernel → socket buffer → user-space. Every hop is a tax — context switch, copy, queue. At 1 Gbps nobody cared. At 400 Gbps, with millions of packets per second, every microsecond of CPU on the wire-side is a microsecond the GPU is sitting idle waiting for gradients.

So AI fabric uses RDMA — Remote Direct Memory Access. Think of it as DMA you already know on a PCIe bus, but across the wire. The NIC reads/writes the remote host's memory directly. The OS isn't in the path. A new API called verbs replaces sockets — you post a Work Request to a Queue Pair, the NIC does the rest, you get a completion event when it's done. No syscall per packet.

A clean analogy: sockets are to verbs what BGP network statements are to a route policy. Sockets are imperative ("send these bytes through the kernel"); verbs are descriptive ("here's a memory region, here's a queue, work it out").

How transport evolved — 50 years on one axis

Active spec / standardVendor / proprietarySuperseded

Era-by-era, in one line:

1973 — TCP/IP is designed for unreliable WANs. Reliability over latency. The world adopts it.
1980 — UDP (RFC 768) — connectionless L4 for cases where TCP's overhead is too much.
1990s — HPC outgrows TCP. Specialized fabrics (Myrinet, Quadrics) appear, optimized for deterministic low-latency.
1999 — InfiniBand is born. The IBTA (Compaq, Dell, HP, IBM, Intel, Microsoft, Sun) defines RDMA and the verbs API. Mellanox (founded the same year) becomes the dominant silicon vendor as IB ships.
2000 — SCTP (RFC 4960) — multi-stream, multi-home; finds its home in telecom (SIGTRAN, 5G N2).
2001 — ECN (RFC 3168) is standardized. The network can mark, not just drop.
2007 — iWARP RFCs (5040–5045) — RDMA over TCP. Doesn't need lossless Ethernet but loses to RoCE on performance.
2010 — RoCE v1 brings RDMA to Ethernet, L2-only. Pairs with DCB (PFC + ETS + DCBX) to fake a lossless fabric on commodity Ethernet.
2012 — QUIC is invented at Google. User-space, multiplexed, UDP-based. Becomes HTTP/3 (RFC 9000 in 2021).
2013 — MPTCP (RFC 6824) — TCP across multiple paths.
2014 — RoCE v2 wraps RDMA in UDP/IP. Routable across L3, works with ECMP, BGP underlays, Clos fabrics. The "real" RoCE.
2015 — DCQCN paper (Microsoft Research, SIGCOMM). ECN-driven, NIC-level rate control. Proves RoCE at hyperscale. RoCE moves from research to production.
2018 — AWS EFA / SRD ships. Out-of-order delivery, packet spraying, libfabric. The first major hyperscaler-custom transport.
2019 — NVIDIA buys Mellanox (~$6.9B). GPUs + NICs + switches under one vendor. Vertically integrated AI fabric begins.
2020 — Alibaba eRDMA — RDMA for cloud tenants on Alibaba Cloud ECS.
2023 — NVIDIA Spectrum-X announced at COMPUTEX. Vertically integrated AI-Ethernet stack.
2023 — Ultra Ethernet Consortium (UEC) forms — AMD, Arista, Broadcom, Cisco, HPE, Intel, Meta, Microsoft, and ~50 others. Without NVIDIA. Goal: AI-Ethernet that doesn't depend on PFC.
2023 — Google Falcon unveiled at OCP Summit. HW transport, multi-ULP (RDMA + NVMe). Runs on Intel E2100 IPU.
2024 — MRC (Multipath Reliable Connection) announced — OpenAI + AMD + Microsoft + NVIDIA + Broadcom + Intel via OCP. Evolution of RoCE v2 with packet spraying, μs failover, and verbs compatibility.
2025 — UEC 1.0 ships. Packet spraying, multipath, selective retransmission, AI-tuned transport designed to operate without PFC.

The arc:

The network is no longer just packet-forwarding infrastructure — it is part of the distributed compute system itself.

The four families

The transport landscape today sits in four distinct buckets. Each solves a different problem, and they coexist:

#	Family	Solves
1	Classic IP transports	General networking — internet, applications, control plane
2	RDMA transports (traditional)	Kernel-bypass for HPC and traditional AI clusters
3	AI / hyperscaler custom transports	RoCEv2 at 100K+ GPU scale (multipath, no PFC dependence)
4	Scale-up interconnects	Intra-server / intra-rack GPU-to-GPU communication (different domain)

Family 1 — Classic IP transports (Layer 4)

You know these. Listed for completeness.

Protocol	Owner / Std	Reliable	Ordered	Multipath	Encryption	Key trait	Used by / for
TCP	IETF	Yes	Yes	No	External (TLS)	Byte-stream, AIMD CC, HoL blocking	Web, SSH, SMTP — universal
UDP	IETF	No	No	No	External (DTLS)	Connectionless, low overhead	DNS, DHCP, VoIP, gaming, QUIC base
SCTP	IETF	Yes	Per-stream	Multi-home failover	External (DTLS)	Multi-streaming + multi-homing	Telecom — SS7/SIGTRAN, Diameter, 5G N2
DCCP	IETF	No	No	No	No	Unreliable + congestion control	Mostly research / abandoned
QUIC	IETF (Google origin)	Yes	Per-stream	Connection migration	Built-in (TLS 1.3)	0/1-RTT setup, user-space	HTTP/3 — Google, Cloudflare, Meta, Apple
MPTCP	IETF	Yes	Yes	Yes (subflows)	External	TCP across multiple paths	Apple Siri/iOS, Samsung, Linux
UDP-Lite	IETF	No	No	No	External	Partial checksum	Loss-tolerant codecs

QUIC is the standout here for an AI-adjacent network engineer — it's the only L4 protocol that natively supports multipath (via connection migration), is fully user-space, and ships with TLS 1.3 built in. You'll meet it on the inference and edge path.

Family 2 — RDMA transports (traditional)

The transports that built HPC and early AI. All share the IBTA verbs API, all are kernel-bypass.

Protocol	Owner / Std	Substrate	Lossless required	Multipath	Encryption	Key trait	Used by / for
InfiniBand	NVIDIA / IBTA	IB fabric	Yes (credit-based FC)	RD mode only	Optional	Native RDMA, sub-μs latency	DGX SuperPOD, TOP500 HPC, Meta RSC
RoCE v1	IBTA (open)	Ethernet L2	Yes (PFC)	Limited	Optional	IB transport over Ethernet, non-routable	Same-subnet RDMA
RoCE v2	IBTA (open)	UDP/IP	Yes (PFC)	Limited (ECMP)	Optional	IB transport over UDP, routable	Azure, Meta, Tencent, ByteDance, Baidu
iWARP	IETF (open)	TCP/IP	No	Via TCP	Optional	RDMA over TCP, no PFC needed	Intel E810, Chelsio (niche)
IB Verbs modes	IBTA	—	—	—	—	RC, RD, UC, UD transport modes	RC dominant in production

Key takeaway: RDMA at scale (RoCE v2) needs a lossless underlay (PFC) and doesn't multipath naturally. Both of those constraints break at 100K+ GPU scale — which is why the next family exists.

Family 3 — AI / hyperscaler custom transports (the new generation)

Each major hyperscaler hit RoCEv2's ceiling and built their own transport. The common pattern: packet spraying for multipath, built-in encryption, out-of-order delivery with hardware reassembly, and microsecond failover when a link or switch breaks.

Protocol	Owner	Substrate	Lossless?	Multipath	Encryption	Key trait	Used by / for
MRC (Multipath Reliable Conn.)	OpenAI + AMD/MS/NV/Broadcom/Intel — OCP	Ethernet/IP	No	Yes (packet spray)	Built-in	Evolution of RoCE v2; μs failover; verbs-compat	OpenAI training, Microsoft Fairwater, Oracle Abilene
Falcon	Google — OCP	Ethernet/IP	No	Yes (PLB)	Built-in (PSP/IPSec)	HW transport, multi-ULP (RDMA + NVMe)	Google Cloud, Intel E2100 IPU
SRD (Scalable Reliable Datagram)	AWS	Ethernet/IP	No	Yes (packet spray)	Built-in	Out-of-order delivery, hw-offloaded, libfabric API	AWS EFA — EC2 P5, Trn1/Trn2, HPC
UET (Ultra Ethernet Transport)	UEC consortium (open)	Ethernet/IP	No	Yes (packet spray)	Built-in	Open standard; ~75% from HPE Slingshot; libfabric 2.0	Industry target — 1M+ endpoint scale
Pony Express	Google (legacy)	Ethernet/IP	No	Limited	Optional	SW-only predecessor to Falcon; ran in Snap microkernel	Older Google datacenter (superseded)
eRDMA	Alibaba	VPC/Ethernet	No	Yes	Optional	RDMA for cloud tenants	Alibaba Cloud ECS

The pattern is the same everywhere: drop the PFC dependency, spray packets across all available paths, hardware-offload reassembly. UET is the open-standard convergence target — expect MRC and Falcon ideas to fold in over time.

Family 4 — Scale-up interconnects

Scale-up is intra-server / intra-rack. Scale-out (everything above) is inter-server. Scale-up connects GPUs inside one logical box; scale-out connects boxes to other boxes. They are disjoint problems with different bandwidths (TB/s vs Gbps), different latencies (ns vs μs), and different protocols.

Protocol	Owner	Domain	Key trait	Used by / for
NVLink / NVSwitch / NVLink Fabric	NVIDIA (proprietary)	Scale-up GPU	Up to 1.8 TB/s per GPU; sub-μs	DGX, GB200 NVL72, HGX
UALink	AMD / Broadcom / Cisco / Google / HPE / Intel / Meta / MS — open	Scale-up GPU	Open NVLink alternative; v1.0 in 2025	Future open AI servers
SUE (Scale-Up Ethernet)	Broadcom	Scale-up GPU	Simpler than UET; ≤1.6 Tbps, ~100 ns device latency	Broadcom AI silicon
ICI (Inter-Chip Interconnect)	Google	Scale-up TPU	Native TPU pod fabric	Google TPU v4 / v5p / Trillium pods
Slingshot / Portals 4	HPE (Cray)	HPC scale-out	Adaptive routing; UET 1.0 lineage (~75%)	Frontier, El Capitan, Aurora, leadership HPC
OmniPath	Cornelis Networks (ex-Intel)	HPC scale-out	InfiniBand-style fabric	Some HPC sites
RDS (Reliable Datagram Sockets)	Oracle	Cluster IPC	Reliable datagrams over IB / RoCE / TCP	Oracle RAC interconnect
TIPC	Ericsson / Linux	Cluster IPC	Topology-aware cluster messaging	Telecom clusters

GB200 NVL72 puts 72 GPUs on one NVLink Switch fabric — that's one logical machine over scale-up. RDMA only takes over at the rack boundary.

Mental model — the 6-point synthesis

Classic IP transports cover the internet. TCP, UDP, QUIC. Universal but kernel-bound.
RDMA family (IB, RoCE v2, iWARP) covers traditional HPC/AI — but needs a lossless fabric (PFC) and doesn't multipath well.
Each hyperscaler built a custom multipath transport because RoCEv2 doesn't scale to 100K+ GPUs: Google → Falcon, AWS → SRD, OpenAI/Microsoft/NVIDIA/AMD → MRC, Alibaba → eRDMA.
UET is the open-standard convergence target. Expect MRC and Falcon ideas to fold into it over time.
Scale-up (NVLink / UALink / SUE / ICI) is intra-server and disjoint from scale-out transports. Different domain, different physics, different protocols.
Congestion control matters as much as the transport. Most modern AI fabrics combine packet spraying + delay-based CC + ECN/INT signals + microsecond failover.

Who built what

Tech	Owner / Standards body	When
TCP/IP	IETF (DARPA)	1973–1980s
UDP	IETF	1980
InfiniBand spec	IBTA consortium	1999
Mellanox IB silicon	Mellanox Technologies (Israel)	1999 → acquired by NVIDIA 2019
RDMA Verbs API	IBTA / OpenFabrics Alliance	2000s
SCTP	IETF	2000
RoCE v1	IBTA	2010
QUIC	Google → IETF	2012 / RFC 9000 in 2021
MPTCP	IETF	2013
RoCE v2	IBTA	2014
DCQCN	Microsoft Research	SIGCOMM 2015
Pony Express	Google (legacy)	~2014–2023
AWS EFA / SRD	AWS	2018+
eRDMA	Alibaba	2020
Spectrum-X	NVIDIA	2023
Ultra Ethernet Consortium	AMD, Arista, Broadcom, Cisco, HPE, Intel, Meta, Microsoft, +50 others	Founded 2023, spec 1.0 in 2025
Falcon	Google + Intel (E2100 IPU)	2023
MRC	OpenAI + AMD + Microsoft + NVIDIA + Broadcom + Intel (OCP)	2024
UALink Consortium	AMD, Broadcom, Cisco, Google, HPE, Intel, Meta, Microsoft	2024 / v1.0 in 2025
SUE (Scale-Up Ethernet)	Broadcom	2024
ICI	Google	TPU v4 era (~2018)
Slingshot	HPE (Cray)	2019

2. Congestion Control

What is congestion control?

Congestion control is what the network does when more traffic is offered than a link can carry. It's not "if the link is full" — it's "the link is about to be full, and we need to decide who slows down before packets start dropping or queues blow up."

In AI fabric this matters because one congested link can stall a synchronized collective (AllReduce, AllGather) across the entire job. A 0.1% packet loss rate is fine for TCP. It's a 10× throughput hit for RDMA.

What you already know

You've shipped this in production:

TCP congestion control — slow start, congestion-avoidance, fast-retransmit; Reno → Cubic → BBR
ECN (RFC 3168) — switch marks the IP header at congested egress; receiver echoes back; sender slows down
RED / WRED — random early drop; the switch starts dropping before the queue is full
PFC (IEEE 802.1Qbb) — link-level pause frames per priority class; backpressure instead of drop
QoS toolkit — DSCP, COS, queue scheduling, buffer profiles, headroom, watermarks

AI fabric uses these same primitives — but the closed-loop algorithm running on the NIC is different. Where TCP backs off after a packet drops, RDMA fabric uses ECN to warn the sender before drop happens, and the NIC adjusts its send rate proactively. PFC is the safety net underneath.

A mental model:

PFC = backpressure (pull the emergency brake — stop sending on this priority right now)
ECN = early warning (the dashboard light flashes — slow down, congestion ahead)
CC algorithm = the driver's foot — uses ECN signals to dial back the rate before PFC has to fire

How congestion control evolved

Active spec / standardVendor / proprietarySuperseded

Era-by-era, in one line:

1988 — TCP slow start + congestion avoidance (Berkeley / IETF). The field is born.
2001 — ECN standardized (RFC 3168). The network can mark, not just drop.
2006 — CUBIC becomes Linux default. Cubic growth function — fills bandwidth aggressively.
2010 — DCTCP (Microsoft / Stanford, SIGCOMM). Fine-grained ECN-based CC for datacenter TCP. Proves precise ECN works.
2015 — DCQCN (Microsoft Research, SIGCOMM). DCTCP's ideas applied to RoCE. Canonical RDMA CC.
2015 — TIMELY (Google, SIGCOMM). RTT-gradient-based. Delay, not ECN, as the signal.
2016 — BBR (Google). Model-based; fills bottleneck without filling buffers. Now in YouTube, QUIC, Linux kernel.
2017 — NDP / Trim (Cambridge research). Switch trims payload on congestion instead of dropping.
2018 — AWS SRD CC ships with EFA. Path-level feedback; sub-10 ms RTO.
2018 — Homa (Stanford research). Receiver-driven, priorities + grants; eliminates HoL via message-oriented design.
2019 — HPCC (Alibaba, SIGCOMM). Uses In-band Network Telemetry (INT) for precise per-link telemetry.
2020 — Swift (Google, SIGCOMM). Successor to TIMELY. Decomposes host vs fabric latency. Becomes Falcon's CC core.
2022 — PowerTCP (NSDI). Bandwidth × queue depth combined signal.
2023 — Spectrum-X CC (NVIDIA). Switch + NIC co-designed. Closed system.
2023 — Falcon CC = Swift + CSIG + Carousel (Google + Intel E2100). HW per-flow shaping, multipath PLB.
2024 — MRC CC (OpenAI / Microsoft / NVIDIA / AMD / Broadcom / Intel). Programmable CC + μs rerouting.
2025 — UET CC (Ultra Ethernet Consortium). Two-sided CC for packet-sprayed environment.

The arc:

Congestion control moved from a reactive software algorithm (TCP) to a proactive, hardware-co-designed, fabric-wide control loop (DCQCN, HPCC, UET).

CC algorithm families

TCP family — what runs the internet

Algorithm	Family	Signal	Key trait	Where used
Reno / NewReno	TCP	Loss	Classic AIMD; baseline	Legacy TCP everywhere
CUBIC	TCP	Loss	Cubic growth — default in Linux/Windows	Most internet TCP today
Vegas / Westwood	TCP	Delay / bw-est	Delay-based; low queueing	Niche / research
Compound TCP	TCP	Loss + delay	Hybrid	Older Windows
BBR v1 / v2 / v3	TCP / QUIC	Bandwidth + RTT	Model-based; fills bottleneck without filling buffers	Google services, YouTube, QUIC, Linux kernel

Datacenter & RDMA family — built for tight datacenter networks

Algorithm	Family	Signal	Key trait	Where used
DCTCP	DC TCP	ECN marking	Proportional reaction to ECN; small queues	Microsoft, Linux DC TCP stacks
DCQCN	RoCE v2	ECN + PFC	Default RoCE v2 CC; rate-based	Azure, Meta, Tencent — most RoCE clusters
TIMELY	RDMA	Delay (RTT gradient)	Delay-based, CPU-light	Google early RDMA
Swift	RDMA / Falcon	Delay (NIC RTT)	Decomposes host vs fabric latency; basis for Falcon	Google Falcon
HPCC	RDMA	In-band telemetry (INT)	Precise rate using switch INT data	Alibaba
PowerTCP	DC TCP / RDMA	Power = bw × queue	Combines bandwidth and queue depth signals	Research / select DCs

AI / hyperscaler custom CC — the new generation

Algorithm	Family	Signal	Key trait	Where used
MRC CC	MRC	Multipath telemetry	Programmable CC + microsecond rerouting	OpenAI / Microsoft Fairwater / Oracle Abilene
Falcon CC (Swift + CSIG + Carousel)	Falcon	Delay + congestion sig.	HW per-flow shaping, multipath PLB	Google + Intel E2100
SRD CC	SRD	Path-level feedback	Avoids overloaded paths; <10 ms RTO	AWS EFA
UET CC	UET	Sender + receiver based	Two-sided CC for packet-sprayed environment	Ultra Ethernet 1.0
Spectrum-X CC	Spectrum-X	Switch + NIC telemetry	Switch + NIC co-designed	NVIDIA Spectrum-X

Research / exotic

Algorithm	Family	Signal	Key trait	Where used
Homa	Receiver-driven	Priorities + grants	Eliminates HoL via priorities; message-oriented	Stanford research, influential
NDP / Trim	Switch-assisted	Header trimming	Switch trims payload on congestion; no whole-packet drop	Cambridge research
ExpressPass	Credit-based	Receiver credits	Receiver paces with credit packets	Research
EQDS	Edge-queued	Edge-based shaping	Pushes queues to edges, not core	Cambridge / UCL research

Mental model

Most modern AI fabrics combine four ideas: packet spraying + delay-based CC + ECN/INT signals + microsecond failover.

Pick any production AI-fabric CC algorithm — MRC, Falcon, SRD, UET CC, Spectrum-X CC — and you'll find some combination of these four. The PFC-only era is ending.

Who built what

Tech	Owner / Standards body	When
TCP Reno / NewReno	Berkeley / IETF	1988
ECN (RFC 3168)	IETF	2001
CUBIC	NCSU → Linux	2006
DCTCP	Microsoft / Stanford	SIGCOMM 2010
DCQCN	Microsoft Research	SIGCOMM 2015
TIMELY	Google	SIGCOMM 2015
BBR	Google	2016
NDP / Trim	Cambridge	2017
AWS SRD CC	AWS	2018
Homa	Stanford	2018
HPCC	Alibaba	SIGCOMM 2019
Swift	Google	SIGCOMM 2020
PowerTCP	Research consortium	NSDI 2022
Spectrum-X CC	NVIDIA	2023
Falcon CC	Google + Intel	2023
MRC CC	OpenAI + Microsoft + NVIDIA + AMD + Broadcom + Intel	2024
UET CC	Ultra Ethernet Consortium	2024–2025

📄 Transports & Congestion Control — One-Pager — same content, denser, single-sheet print format

What this curriculum walks

Of every option above, the course teaches:

Transport: RoCEv2
Congestion control: DCQCN + PFC/ECN

Why this combination: it's the most-deployed RDMA-on-Ethernet pattern in 2026, vendor-neutral, well-documented in public standards (IBTA, IEEE), and what you'll see at most hyperscalers running training on Ethernet today.

If your stack differs (you run InfiniBand, UEC, MRC, Falcon, SRD, or another custom transport), the same protocol vocabulary still applies — you swap the implementation, not the concepts.

The other layers of the AI fabric — host networking, Kubernetes, GPU drivers, topology — are covered in their own phases of the curriculum.

Where to next

Phase 1 — AI for Network Engineers — start the curriculum
About — who built this

1. Transport​

What is transport?​

What you already know​

How transport evolved — 50 years on one axis​

The four families​

Family 1 — Classic IP transports (Layer 4)​

Family 2 — RDMA transports (traditional)​

Family 3 — AI / hyperscaler custom transports (the new generation)​

Family 4 — Scale-up interconnects​

Mental model — the 6-point synthesis​

Who built what​

2. Congestion Control​

What is congestion control?​

What you already know​

How congestion control evolved​

CC algorithm families​

TCP family — what runs the internet​

Datacenter & RDMA family — built for tight datacenter networks​

AI / hyperscaler custom CC — the new generation​

Research / exotic​

Mental model​

Who built what​

What this curriculum walks​

Where to next​

1. Transport

What is transport?

What you already know

How transport evolved — 50 years on one axis

The four families

Family 1 — Classic IP transports (Layer 4)

Family 2 — RDMA transports (traditional)

Family 3 — AI / hyperscaler custom transports (the new generation)

Family 4 — Scale-up interconnects

Mental model — the 6-point synthesis

Who built what

2. Congestion Control

What is congestion control?

What you already know

How congestion control evolved

CC algorithm families

TCP family — what runs the internet

Datacenter & RDMA family — built for tight datacenter networks

AI / hyperscaler custom CC — the new generation

Research / exotic

Mental model

Who built what

What this curriculum walks

Where to next