The Field Guide
Transport + Congestion Control. The wire and the brakes of an AI fabric.
This guide covers the two layers that make an AI fabric an AI fabric: transport (how bytes move between GPUs) and congestion control (what happens when too many bytes try to move at once). Everything else in the stack โ host networking, Kubernetes, GPU drivers, topology โ gets covered in its own phase of the curriculum.
๐ Transports & Congestion Control โ One-Pager โ the dense reference card behind this page. Print it, pin it to your wall.
1. Transportโ
What is transport?โ
Transport is whatever sits on top of IP and gets your bytes from one host to another. In your day job that's TCP and UDP. In AI fabric it's something different โ but the role is the same: framing, reliability, ordering, flow control, multipath, encryption, and the API the application uses to push bytes through.
In an AI training cluster, the transport carries gradients (the math output of one GPU that the others need) between GPU NICs. Every training step is a synchronized burst across thousands of GPUs. The transport decides how fast they sync โ and how much CPU it costs to do it.
What you already knowโ
You've been running this every day:
- L4 protocols โ TCP (reliable, ordered, connection-oriented) and UDP (best-effort, fire-and-forget)
- Sockets API โ
socket(),connect(),send(),recv()โ every app talks to the kernel; the kernel talks to the NIC - TCP mechanics โ three-way handshake, sliding window, SACK, fast retransmit, MSS / MTU, segmentation
- Congestion control โ slow start, Reno, Cubic, BBR. Backs off when the network drops or marks packets
- QoS tooling โ DSCP, COS, queue-per-priority, RED/WRED, ECN marking at congested egress
The piece you've probably never dealt with before: kernel bypass. Your stack today says NIC โ driver โ kernel โ socket buffer โ user-space. Every hop is a tax โ context switch, copy, queue. At 1 Gbps nobody cared. At 400 Gbps, with millions of packets per second, every microsecond of CPU on the wire-side is a microsecond the GPU is sitting idle waiting for gradients.
So AI fabric uses RDMA โ Remote Direct Memory Access. Think of it as DMA you already know on a PCIe bus, but across the wire. The NIC reads/writes the remote host's memory directly. The OS isn't in the path. A new API called verbs replaces sockets โ you post a Work Request to a Queue Pair, the NIC does the rest, you get a completion event when it's done. No syscall per packet.
A clean analogy: sockets are to verbs what BGP network statements are to a route policy. Sockets are imperative ("send these bytes through the kernel"); verbs are descriptive ("here's a memory region, here's a queue, work it out").
How transport evolved โ 50 years on one axisโ
Era-by-era, in one line:
- 1973 โ TCP/IP is designed for unreliable WANs. Reliability over latency. The world adopts it.
- 1980 โ UDP (RFC 768) โ connectionless L4 for cases where TCP's overhead is too much.
- 1990s โ HPC outgrows TCP. Specialized fabrics (Myrinet, Quadrics) appear, optimized for deterministic low-latency.
- 1999 โ InfiniBand is born. The IBTA (Compaq, Dell, HP, IBM, Intel, Microsoft, Sun) defines RDMA and the verbs API. Mellanox (founded the same year) becomes the dominant silicon vendor as IB ships.
- 2000 โ SCTP (RFC 4960) โ multi-stream, multi-home; finds its home in telecom (SIGTRAN, 5G N2).
- 2001 โ ECN (RFC 3168) is standardized. The network can mark, not just drop.
- 2007 โ iWARP RFCs (5040โ5045) โ RDMA over TCP. Doesn't need lossless Ethernet but loses to RoCE on performance.
- 2010 โ RoCE v1 brings RDMA to Ethernet, L2-only. Pairs with DCB (PFC + ETS + DCBX) to fake a lossless fabric on commodity Ethernet.
- 2012 โ QUIC is invented at Google. User-space, multiplexed, UDP-based. Becomes HTTP/3 (RFC 9000 in 2021).
- 2013 โ MPTCP (RFC 6824) โ TCP across multiple paths.
- 2014 โ RoCE v2 wraps RDMA in UDP/IP. Routable across L3, works with ECMP, BGP underlays, Clos fabrics. The "real" RoCE.
- 2015 โ DCQCN paper (Microsoft Research, SIGCOMM). ECN-driven, NIC-level rate control. Proves RoCE at hyperscale. RoCE moves from research to production.
- 2018 โ AWS EFA / SRD ships. Out-of-order delivery, packet spraying, libfabric. The first major hyperscaler-custom transport.
- 2019 โ NVIDIA buys Mellanox (~$6.9B). GPUs + NICs + switches under one vendor. Vertically integrated AI fabric begins.
- 2020 โ Alibaba eRDMA โ RDMA for cloud tenants on Alibaba Cloud ECS.
- 2023 โ NVIDIA Spectrum-X announced at COMPUTEX. Vertically integrated AI-Ethernet stack.
- 2023 โ Ultra Ethernet Consortium (UEC) forms โ AMD, Arista, Broadcom, Cisco, HPE, Intel, Meta, Microsoft, and ~50 others. Without NVIDIA. Goal: AI-Ethernet that doesn't depend on PFC.
- 2023 โ Google Falcon unveiled at OCP Summit. HW transport, multi-ULP (RDMA + NVMe). Runs on Intel E2100 IPU.
- 2024 โ MRC (Multipath Reliable Connection) announced โ OpenAI + AMD + Microsoft + NVIDIA + Broadcom + Intel via OCP. Evolution of RoCE v2 with packet spraying, ฮผs failover, and verbs compatibility.
- 2025 โ UEC 1.0 ships. Packet spraying, multipath, selective retransmission, AI-tuned transport designed to operate without PFC.
The arc:
The network is no longer just packet-forwarding infrastructure โ it is part of the distributed compute system itself.
The four familiesโ
The transport landscape today sits in four distinct buckets. Each solves a different problem, and they coexist:
| # | Family | Solves |
|---|---|---|
| 1 | Classic IP transports | General networking โ internet, applications, control plane |
| 2 | RDMA transports (traditional) | Kernel-bypass for HPC and traditional AI clusters |
| 3 | AI / hyperscaler custom transports | RoCEv2 at 100K+ GPU scale (multipath, no PFC dependence) |
| 4 | Scale-up interconnects | Intra-server / intra-rack GPU-to-GPU communication (different domain) |
Family 1 โ Classic IP transports (Layer 4)โ
You know these. Listed for completeness.
| Protocol | Owner / Std | Reliable | Ordered | Multipath | Encryption | Key trait | Used by / for |
|---|---|---|---|---|---|---|---|
| TCP | IETF | Yes | Yes | No | External (TLS) | Byte-stream, AIMD CC, HoL blocking | Web, SSH, SMTP โ universal |
| UDP | IETF | No | No | No | External (DTLS) | Connectionless, low overhead | DNS, DHCP, VoIP, gaming, QUIC base |
| SCTP | IETF | Yes | Per-stream | Multi-home failover | External (DTLS) | Multi-streaming + multi-homing | Telecom โ SS7/SIGTRAN, Diameter, 5G N2 |
| DCCP | IETF | No | No | No | No | Unreliable + congestion control | Mostly research / abandoned |
| QUIC | IETF (Google origin) | Yes | Per-stream | Connection migration | Built-in (TLS 1.3) | 0/1-RTT setup, user-space | HTTP/3 โ Google, Cloudflare, Meta, Apple |
| MPTCP | IETF | Yes | Yes | Yes (subflows) | External | TCP across multiple paths | Apple Siri/iOS, Samsung, Linux |
| UDP-Lite | IETF | No | No | No | External | Partial checksum | Loss-tolerant codecs |
QUIC is the standout here for an AI-adjacent network engineer โ it's the only L4 protocol that natively supports multipath (via connection migration), is fully user-space, and ships with TLS 1.3 built in. You'll meet it on the inference and edge path.
Family 2 โ RDMA transports (traditional)โ
The transports that built HPC and early AI. All share the IBTA verbs API, all are kernel-bypass.
| Protocol | Owner / Std | Substrate | Lossless required | Multipath | Encryption | Key trait | Used by / for |
|---|---|---|---|---|---|---|---|
| InfiniBand | NVIDIA / IBTA | IB fabric | Yes (credit-based FC) | RD mode only | Optional | Native RDMA, sub-ฮผs latency | DGX SuperPOD, TOP500 HPC, Meta RSC |
| RoCE v1 | IBTA (open) | Ethernet L2 | Yes (PFC) | Limited | Optional | IB transport over Ethernet, non-routable | Same-subnet RDMA |
| RoCE v2 | IBTA (open) | UDP/IP | Yes (PFC) | Limited (ECMP) | Optional | IB transport over UDP, routable | Azure, Meta, Tencent, ByteDance, Baidu |
| iWARP | IETF (open) | TCP/IP | No | Via TCP | Optional | RDMA over TCP, no PFC needed | Intel E810, Chelsio (niche) |
| IB Verbs modes | IBTA | โ | โ | โ | โ | RC, RD, UC, UD transport modes | RC dominant in production |
Key takeaway: RDMA at scale (RoCE v2) needs a lossless underlay (PFC) and doesn't multipath naturally. Both of those constraints break at 100K+ GPU scale โ which is why the next family exists.
Family 3 โ AI / hyperscaler custom transports (the new generation)โ
Each major hyperscaler hit RoCEv2's ceiling and built their own transport. The common pattern: packet spraying for multipath, built-in encryption, out-of-order delivery with hardware reassembly, and microsecond failover when a link or switch breaks.
| Protocol | Owner | Substrate | Lossless? | Multipath | Encryption | Key trait | Used by / for |
|---|---|---|---|---|---|---|---|
| MRC (Multipath Reliable Conn.) | OpenAI + AMD/MS/NV/Broadcom/Intel โ OCP | Ethernet/IP | No | Yes (packet spray) | Built-in | Evolution of RoCE v2; ฮผs failover; verbs-compat | OpenAI training, Microsoft Fairwater, Oracle Abilene |
| Falcon | Google โ OCP | Ethernet/IP | No | Yes (PLB) | Built-in (PSP/IPSec) | HW transport, multi-ULP (RDMA + NVMe) | Google Cloud, Intel E2100 IPU |
| SRD (Scalable Reliable Datagram) | AWS | Ethernet/IP | No | Yes (packet spray) | Built-in | Out-of-order delivery, hw-offloaded, libfabric API | AWS EFA โ EC2 P5, Trn1/Trn2, HPC |
| UET (Ultra Ethernet Transport) | UEC consortium (open) | Ethernet/IP | No | Yes (packet spray) | Built-in | Open standard; ~75% from HPE Slingshot; libfabric 2.0 | Industry target โ 1M+ endpoint scale |
| Pony Express | Google (legacy) | Ethernet/IP | No | Limited | Optional | SW-only predecessor to Falcon; ran in Snap microkernel | Older Google datacenter (superseded) |
| eRDMA | Alibaba | VPC/Ethernet | No | Yes | Optional | RDMA for cloud tenants | Alibaba Cloud ECS |
The pattern is the same everywhere: drop the PFC dependency, spray packets across all available paths, hardware-offload reassembly. UET is the open-standard convergence target โ expect MRC and Falcon ideas to fold in over time.
Family 4 โ Scale-up interconnectsโ
Scale-up is intra-server / intra-rack. Scale-out (everything above) is inter-server. Scale-up connects GPUs inside one logical box; scale-out connects boxes to other boxes. They are disjoint problems with different bandwidths (TB/s vs Gbps), different latencies (ns vs ฮผs), and different protocols.
| Protocol | Owner | Domain | Key trait | Used by / for |
|---|---|---|---|---|
| NVLink / NVSwitch / NVLink Fabric | NVIDIA (proprietary) | Scale-up GPU | Up to 1.8 TB/s per GPU; sub-ฮผs | DGX, GB200 NVL72, HGX |
| UALink | AMD / Broadcom / Cisco / Google / HPE / Intel / Meta / MS โ open | Scale-up GPU | Open NVLink alternative; v1.0 in 2025 | Future open AI servers |
| SUE (Scale-Up Ethernet) | Broadcom | Scale-up GPU | Simpler than UET; โค1.6 Tbps, ~100 ns device latency | Broadcom AI silicon |
| ICI (Inter-Chip Interconnect) | Scale-up TPU | Native TPU pod fabric | Google TPU v4 / v5p / Trillium pods | |
| Slingshot / Portals 4 | HPE (Cray) | HPC scale-out | Adaptive routing; UET 1.0 lineage (~75%) | Frontier, El Capitan, Aurora, leadership HPC |
| OmniPath | Cornelis Networks (ex-Intel) | HPC scale-out | InfiniBand-style fabric | Some HPC sites |
| RDS (Reliable Datagram Sockets) | Oracle | Cluster IPC | Reliable datagrams over IB / RoCE / TCP | Oracle RAC interconnect |
| TIPC | Ericsson / Linux | Cluster IPC | Topology-aware cluster messaging | Telecom clusters |
GB200 NVL72 puts 72 GPUs on one NVLink Switch fabric โ that's one logical machine over scale-up. RDMA only takes over at the rack boundary.
Mental model โ the 6-point synthesisโ
- Classic IP transports cover the internet. TCP, UDP, QUIC. Universal but kernel-bound.
- RDMA family (IB, RoCE v2, iWARP) covers traditional HPC/AI โ but needs a lossless fabric (PFC) and doesn't multipath well.
- Each hyperscaler built a custom multipath transport because RoCEv2 doesn't scale to 100K+ GPUs: Google โ Falcon, AWS โ SRD, OpenAI/Microsoft/NVIDIA/AMD โ MRC, Alibaba โ eRDMA.
- UET is the open-standard convergence target. Expect MRC and Falcon ideas to fold into it over time.
- Scale-up (NVLink / UALink / SUE / ICI) is intra-server and disjoint from scale-out transports. Different domain, different physics, different protocols.
- Congestion control matters as much as the transport. Most modern AI fabrics combine packet spraying + delay-based CC + ECN/INT signals + microsecond failover.
Who built whatโ
| Tech | Owner / Standards body | When |
|---|---|---|
| TCP/IP | IETF (DARPA) | 1973โ1980s |
| UDP | IETF | 1980 |
| InfiniBand spec | IBTA consortium | 1999 |
| Mellanox IB silicon | Mellanox Technologies (Israel) | 1999 โ acquired by NVIDIA 2019 |
| RDMA Verbs API | IBTA / OpenFabrics Alliance | 2000s |
| SCTP | IETF | 2000 |
| RoCE v1 | IBTA | 2010 |
| QUIC | Google โ IETF | 2012 / RFC 9000 in 2021 |
| MPTCP | IETF | 2013 |
| RoCE v2 | IBTA | 2014 |
| DCQCN | Microsoft Research | SIGCOMM 2015 |
| Pony Express | Google (legacy) | ~2014โ2023 |
| AWS EFA / SRD | AWS | 2018+ |
| eRDMA | Alibaba | 2020 |
| Spectrum-X | NVIDIA | 2023 |
| Ultra Ethernet Consortium | AMD, Arista, Broadcom, Cisco, HPE, Intel, Meta, Microsoft, +50 others | Founded 2023, spec 1.0 in 2025 |
| Falcon | Google + Intel (E2100 IPU) | 2023 |
| MRC | OpenAI + AMD + Microsoft + NVIDIA + Broadcom + Intel (OCP) | 2024 |
| UALink Consortium | AMD, Broadcom, Cisco, Google, HPE, Intel, Meta, Microsoft | 2024 / v1.0 in 2025 |
| SUE (Scale-Up Ethernet) | Broadcom | 2024 |
| ICI | TPU v4 era (~2018) | |
| Slingshot | HPE (Cray) | 2019 |
2. Congestion Controlโ
What is congestion control?โ
Congestion control is what the network does when more traffic is offered than a link can carry. It's not "if the link is full" โ it's "the link is about to be full, and we need to decide who slows down before packets start dropping or queues blow up."
In AI fabric this matters because one congested link can stall a synchronized collective (AllReduce, AllGather) across the entire job. A 0.1% packet loss rate is fine for TCP. It's a 10ร throughput hit for RDMA.
What you already knowโ
You've shipped this in production:
- TCP congestion control โ slow start, congestion-avoidance, fast-retransmit; Reno โ Cubic โ BBR
- ECN (RFC 3168) โ switch marks the IP header at congested egress; receiver echoes back; sender slows down
- RED / WRED โ random early drop; the switch starts dropping before the queue is full
- PFC (IEEE 802.1Qbb) โ link-level pause frames per priority class; backpressure instead of drop
- QoS toolkit โ DSCP, COS, queue scheduling, buffer profiles, headroom, watermarks
AI fabric uses these same primitives โ but the closed-loop algorithm running on the NIC is different. Where TCP backs off after a packet drops, RDMA fabric uses ECN to warn the sender before drop happens, and the NIC adjusts its send rate proactively. PFC is the safety net underneath.
A mental model:
- PFC = backpressure (pull the emergency brake โ stop sending on this priority right now)
- ECN = early warning (the dashboard light flashes โ slow down, congestion ahead)
- CC algorithm = the driver's foot โ uses ECN signals to dial back the rate before PFC has to fire
How congestion control evolvedโ
Era-by-era, in one line:
- 1988 โ TCP slow start + congestion avoidance (Berkeley / IETF). The field is born.
- 2001 โ ECN standardized (RFC 3168). The network can mark, not just drop.
- 2006 โ CUBIC becomes Linux default. Cubic growth function โ fills bandwidth aggressively.
- 2010 โ DCTCP (Microsoft / Stanford, SIGCOMM). Fine-grained ECN-based CC for datacenter TCP. Proves precise ECN works.
- 2015 โ DCQCN (Microsoft Research, SIGCOMM). DCTCP's ideas applied to RoCE. Canonical RDMA CC.
- 2015 โ TIMELY (Google, SIGCOMM). RTT-gradient-based. Delay, not ECN, as the signal.
- 2016 โ BBR (Google). Model-based; fills bottleneck without filling buffers. Now in YouTube, QUIC, Linux kernel.
- 2017 โ NDP / Trim (Cambridge research). Switch trims payload on congestion instead of dropping.
- 2018 โ AWS SRD CC ships with EFA. Path-level feedback; sub-10 ms RTO.
- 2018 โ Homa (Stanford research). Receiver-driven, priorities + grants; eliminates HoL via message-oriented design.
- 2019 โ HPCC (Alibaba, SIGCOMM). Uses In-band Network Telemetry (INT) for precise per-link telemetry.
- 2020 โ Swift (Google, SIGCOMM). Successor to TIMELY. Decomposes host vs fabric latency. Becomes Falcon's CC core.
- 2022 โ PowerTCP (NSDI). Bandwidth ร queue depth combined signal.
- 2023 โ Spectrum-X CC (NVIDIA). Switch + NIC co-designed. Closed system.
- 2023 โ Falcon CC = Swift + CSIG + Carousel (Google + Intel E2100). HW per-flow shaping, multipath PLB.
- 2024 โ MRC CC (OpenAI / Microsoft / NVIDIA / AMD / Broadcom / Intel). Programmable CC + ฮผs rerouting.
- 2025 โ UET CC (Ultra Ethernet Consortium). Two-sided CC for packet-sprayed environment.
The arc:
Congestion control moved from a reactive software algorithm (TCP) to a proactive, hardware-co-designed, fabric-wide control loop (DCQCN, HPCC, UET).
CC algorithm familiesโ
TCP family โ what runs the internetโ
| Algorithm | Family | Signal | Key trait | Where used |
|---|---|---|---|---|
| Reno / NewReno | TCP | Loss | Classic AIMD; baseline | Legacy TCP everywhere |
| CUBIC | TCP | Loss | Cubic growth โ default in Linux/Windows | Most internet TCP today |
| Vegas / Westwood | TCP | Delay / bw-est | Delay-based; low queueing | Niche / research |
| Compound TCP | TCP | Loss + delay | Hybrid | Older Windows |
| BBR v1 / v2 / v3 | TCP / QUIC | Bandwidth + RTT | Model-based; fills bottleneck without filling buffers | Google services, YouTube, QUIC, Linux kernel |
Datacenter & RDMA family โ built for tight datacenter networksโ
| Algorithm | Family | Signal | Key trait | Where used |
|---|---|---|---|---|
| DCTCP | DC TCP | ECN marking | Proportional reaction to ECN; small queues | Microsoft, Linux DC TCP stacks |
| DCQCN | RoCE v2 | ECN + PFC | Default RoCE v2 CC; rate-based | Azure, Meta, Tencent โ most RoCE clusters |
| TIMELY | RDMA | Delay (RTT gradient) | Delay-based, CPU-light | Google early RDMA |
| Swift | RDMA / Falcon | Delay (NIC RTT) | Decomposes host vs fabric latency; basis for Falcon | Google Falcon |
| HPCC | RDMA | In-band telemetry (INT) | Precise rate using switch INT data | Alibaba |
| PowerTCP | DC TCP / RDMA | Power = bw ร queue | Combines bandwidth and queue depth signals | Research / select DCs |
AI / hyperscaler custom CC โ the new generationโ
| Algorithm | Family | Signal | Key trait | Where used |
|---|---|---|---|---|
| MRC CC | MRC | Multipath telemetry | Programmable CC + microsecond rerouting | OpenAI / Microsoft Fairwater / Oracle Abilene |
| Falcon CC (Swift + CSIG + Carousel) | Falcon | Delay + congestion sig. | HW per-flow shaping, multipath PLB | Google + Intel E2100 |
| SRD CC | SRD | Path-level feedback | Avoids overloaded paths; <10 ms RTO | AWS EFA |
| UET CC | UET | Sender + receiver based | Two-sided CC for packet-sprayed environment | Ultra Ethernet 1.0 |
| Spectrum-X CC | Spectrum-X | Switch + NIC telemetry | Switch + NIC co-designed | NVIDIA Spectrum-X |
Research / exoticโ
| Algorithm | Family | Signal | Key trait | Where used |
|---|---|---|---|---|
| Homa | Receiver-driven | Priorities + grants | Eliminates HoL via priorities; message-oriented | Stanford research, influential |
| NDP / Trim | Switch-assisted | Header trimming | Switch trims payload on congestion; no whole-packet drop | Cambridge research |
| ExpressPass | Credit-based | Receiver credits | Receiver paces with credit packets | Research |
| EQDS | Edge-queued | Edge-based shaping | Pushes queues to edges, not core | Cambridge / UCL research |
Mental modelโ
Most modern AI fabrics combine four ideas: packet spraying + delay-based CC + ECN/INT signals + microsecond failover.
Pick any production AI-fabric CC algorithm โ MRC, Falcon, SRD, UET CC, Spectrum-X CC โ and you'll find some combination of these four. The PFC-only era is ending.
Who built whatโ
| Tech | Owner / Standards body | When |
|---|---|---|
| TCP Reno / NewReno | Berkeley / IETF | 1988 |
| ECN (RFC 3168) | IETF | 2001 |
| CUBIC | NCSU โ Linux | 2006 |
| DCTCP | Microsoft / Stanford | SIGCOMM 2010 |
| DCQCN | Microsoft Research | SIGCOMM 2015 |
| TIMELY | SIGCOMM 2015 | |
| BBR | 2016 | |
| NDP / Trim | Cambridge | 2017 |
| AWS SRD CC | AWS | 2018 |
| Homa | Stanford | 2018 |
| HPCC | Alibaba | SIGCOMM 2019 |
| Swift | SIGCOMM 2020 | |
| PowerTCP | Research consortium | NSDI 2022 |
| Spectrum-X CC | NVIDIA | 2023 |
| Falcon CC | Google + Intel | 2023 |
| MRC CC | OpenAI + Microsoft + NVIDIA + AMD + Broadcom + Intel | 2024 |
| UET CC | Ultra Ethernet Consortium | 2024โ2025 |
๐ Transports & Congestion Control โ One-Pager โ same content, denser, single-sheet print format
What this curriculum walksโ
Of every option above, the course teaches:
- Transport: RoCEv2
- Congestion control: DCQCN + PFC/ECN
Why this combination: it's the most-deployed RDMA-on-Ethernet pattern in 2026, vendor-neutral, well-documented in public standards (IBTA, IEEE), and what you'll see at most hyperscalers running training on Ethernet today.
If your stack differs (you run InfiniBand, UEC, MRC, Falcon, SRD, or another custom transport), the same protocol vocabulary still applies โ you swap the implementation, not the concepts.
The other layers of the AI fabric โ host networking, Kubernetes, GPU drivers, topology โ are covered in their own phases of the curriculum.
Where to nextโ
- Phase 1 โ AI for Network Engineers โ start the curriculum
- About โ who built this