Skip to main content

The Field Guide

Transport + Congestion Control. The wire and the brakes of an AI fabric.

This guide covers the two layers that make an AI fabric an AI fabric: transport (how bytes move between GPUs) and congestion control (what happens when too many bytes try to move at once). Everything else in the stack โ€” host networking, Kubernetes, GPU drivers, topology โ€” gets covered in its own phase of the curriculum.

๐Ÿ“„ Transports & Congestion Control โ€” One-Pager โ€” the dense reference card behind this page. Print it, pin it to your wall.


1. Transportโ€‹

What is transport?โ€‹

Transport is whatever sits on top of IP and gets your bytes from one host to another. In your day job that's TCP and UDP. In AI fabric it's something different โ€” but the role is the same: framing, reliability, ordering, flow control, multipath, encryption, and the API the application uses to push bytes through.

In an AI training cluster, the transport carries gradients (the math output of one GPU that the others need) between GPU NICs. Every training step is a synchronized burst across thousands of GPUs. The transport decides how fast they sync โ€” and how much CPU it costs to do it.

What you already knowโ€‹

You've been running this every day:

  • L4 protocols โ€” TCP (reliable, ordered, connection-oriented) and UDP (best-effort, fire-and-forget)
  • Sockets API โ€” socket(), connect(), send(), recv() โ€” every app talks to the kernel; the kernel talks to the NIC
  • TCP mechanics โ€” three-way handshake, sliding window, SACK, fast retransmit, MSS / MTU, segmentation
  • Congestion control โ€” slow start, Reno, Cubic, BBR. Backs off when the network drops or marks packets
  • QoS tooling โ€” DSCP, COS, queue-per-priority, RED/WRED, ECN marking at congested egress

The piece you've probably never dealt with before: kernel bypass. Your stack today says NIC โ†’ driver โ†’ kernel โ†’ socket buffer โ†’ user-space. Every hop is a tax โ€” context switch, copy, queue. At 1 Gbps nobody cared. At 400 Gbps, with millions of packets per second, every microsecond of CPU on the wire-side is a microsecond the GPU is sitting idle waiting for gradients.

So AI fabric uses RDMA โ€” Remote Direct Memory Access. Think of it as DMA you already know on a PCIe bus, but across the wire. The NIC reads/writes the remote host's memory directly. The OS isn't in the path. A new API called verbs replaces sockets โ€” you post a Work Request to a Queue Pair, the NIC does the rest, you get a completion event when it's done. No syscall per packet.

A clean analogy: sockets are to verbs what BGP network statements are to a route policy. Sockets are imperative ("send these bytes through the kernel"); verbs are descriptive ("here's a memory region, here's a queue, work it out").

How transport evolved โ€” 50 years on one axisโ€‹

Transport evolutionTCP/IP1973UDP1980SCTP2000QUIC2012MPTCP2013Myrinet/Quadrics1995โ€“2010InfiniBand1999iWARP2007RoCE v12010โ€“2014RoCE v22014Pony Express2014โ€“2023AWS EFA / SRD2018Alibaba eRDMA2020Spectrum-X2023UEC2023Google Falcon2023MRC2024197019801990200020102015202020261973 ยท TCP/IP1999 ยท IBTA formed2010 ยท RoCE v12012 ยท QUIC (Google)2014 ยท RoCE v22015 ยท DCQCN paper2019 ยท NVIDIA buys Mellanox2023 ยท UEC forms2024 ยท MRC (OCP)2025 ยท UEC 1.0
Active spec / standardVendor / proprietarySuperseded

Era-by-era, in one line:

  • 1973 โ€” TCP/IP is designed for unreliable WANs. Reliability over latency. The world adopts it.
  • 1980 โ€” UDP (RFC 768) โ€” connectionless L4 for cases where TCP's overhead is too much.
  • 1990s โ€” HPC outgrows TCP. Specialized fabrics (Myrinet, Quadrics) appear, optimized for deterministic low-latency.
  • 1999 โ€” InfiniBand is born. The IBTA (Compaq, Dell, HP, IBM, Intel, Microsoft, Sun) defines RDMA and the verbs API. Mellanox (founded the same year) becomes the dominant silicon vendor as IB ships.
  • 2000 โ€” SCTP (RFC 4960) โ€” multi-stream, multi-home; finds its home in telecom (SIGTRAN, 5G N2).
  • 2001 โ€” ECN (RFC 3168) is standardized. The network can mark, not just drop.
  • 2007 โ€” iWARP RFCs (5040โ€“5045) โ€” RDMA over TCP. Doesn't need lossless Ethernet but loses to RoCE on performance.
  • 2010 โ€” RoCE v1 brings RDMA to Ethernet, L2-only. Pairs with DCB (PFC + ETS + DCBX) to fake a lossless fabric on commodity Ethernet.
  • 2012 โ€” QUIC is invented at Google. User-space, multiplexed, UDP-based. Becomes HTTP/3 (RFC 9000 in 2021).
  • 2013 โ€” MPTCP (RFC 6824) โ€” TCP across multiple paths.
  • 2014 โ€” RoCE v2 wraps RDMA in UDP/IP. Routable across L3, works with ECMP, BGP underlays, Clos fabrics. The "real" RoCE.
  • 2015 โ€” DCQCN paper (Microsoft Research, SIGCOMM). ECN-driven, NIC-level rate control. Proves RoCE at hyperscale. RoCE moves from research to production.
  • 2018 โ€” AWS EFA / SRD ships. Out-of-order delivery, packet spraying, libfabric. The first major hyperscaler-custom transport.
  • 2019 โ€” NVIDIA buys Mellanox (~$6.9B). GPUs + NICs + switches under one vendor. Vertically integrated AI fabric begins.
  • 2020 โ€” Alibaba eRDMA โ€” RDMA for cloud tenants on Alibaba Cloud ECS.
  • 2023 โ€” NVIDIA Spectrum-X announced at COMPUTEX. Vertically integrated AI-Ethernet stack.
  • 2023 โ€” Ultra Ethernet Consortium (UEC) forms โ€” AMD, Arista, Broadcom, Cisco, HPE, Intel, Meta, Microsoft, and ~50 others. Without NVIDIA. Goal: AI-Ethernet that doesn't depend on PFC.
  • 2023 โ€” Google Falcon unveiled at OCP Summit. HW transport, multi-ULP (RDMA + NVMe). Runs on Intel E2100 IPU.
  • 2024 โ€” MRC (Multipath Reliable Connection) announced โ€” OpenAI + AMD + Microsoft + NVIDIA + Broadcom + Intel via OCP. Evolution of RoCE v2 with packet spraying, ฮผs failover, and verbs compatibility.
  • 2025 โ€” UEC 1.0 ships. Packet spraying, multipath, selective retransmission, AI-tuned transport designed to operate without PFC.

The arc:

The network is no longer just packet-forwarding infrastructure โ€” it is part of the distributed compute system itself.

The four familiesโ€‹

The transport landscape today sits in four distinct buckets. Each solves a different problem, and they coexist:

#FamilySolves
1Classic IP transportsGeneral networking โ€” internet, applications, control plane
2RDMA transports (traditional)Kernel-bypass for HPC and traditional AI clusters
3AI / hyperscaler custom transportsRoCEv2 at 100K+ GPU scale (multipath, no PFC dependence)
4Scale-up interconnectsIntra-server / intra-rack GPU-to-GPU communication (different domain)

Family 1 โ€” Classic IP transports (Layer 4)โ€‹

You know these. Listed for completeness.

ProtocolOwner / StdReliableOrderedMultipathEncryptionKey traitUsed by / for
TCPIETFYesYesNoExternal (TLS)Byte-stream, AIMD CC, HoL blockingWeb, SSH, SMTP โ€” universal
UDPIETFNoNoNoExternal (DTLS)Connectionless, low overheadDNS, DHCP, VoIP, gaming, QUIC base
SCTPIETFYesPer-streamMulti-home failoverExternal (DTLS)Multi-streaming + multi-homingTelecom โ€” SS7/SIGTRAN, Diameter, 5G N2
DCCPIETFNoNoNoNoUnreliable + congestion controlMostly research / abandoned
QUICIETF (Google origin)YesPer-streamConnection migrationBuilt-in (TLS 1.3)0/1-RTT setup, user-spaceHTTP/3 โ€” Google, Cloudflare, Meta, Apple
MPTCPIETFYesYesYes (subflows)ExternalTCP across multiple pathsApple Siri/iOS, Samsung, Linux
UDP-LiteIETFNoNoNoExternalPartial checksumLoss-tolerant codecs

QUIC is the standout here for an AI-adjacent network engineer โ€” it's the only L4 protocol that natively supports multipath (via connection migration), is fully user-space, and ships with TLS 1.3 built in. You'll meet it on the inference and edge path.


Family 2 โ€” RDMA transports (traditional)โ€‹

The transports that built HPC and early AI. All share the IBTA verbs API, all are kernel-bypass.

ProtocolOwner / StdSubstrateLossless requiredMultipathEncryptionKey traitUsed by / for
InfiniBandNVIDIA / IBTAIB fabricYes (credit-based FC)RD mode onlyOptionalNative RDMA, sub-ฮผs latencyDGX SuperPOD, TOP500 HPC, Meta RSC
RoCE v1IBTA (open)Ethernet L2Yes (PFC)LimitedOptionalIB transport over Ethernet, non-routableSame-subnet RDMA
RoCE v2IBTA (open)UDP/IPYes (PFC)Limited (ECMP)OptionalIB transport over UDP, routableAzure, Meta, Tencent, ByteDance, Baidu
iWARPIETF (open)TCP/IPNoVia TCPOptionalRDMA over TCP, no PFC neededIntel E810, Chelsio (niche)
IB Verbs modesIBTAโ€”โ€”โ€”โ€”RC, RD, UC, UD transport modesRC dominant in production

Key takeaway: RDMA at scale (RoCE v2) needs a lossless underlay (PFC) and doesn't multipath naturally. Both of those constraints break at 100K+ GPU scale โ€” which is why the next family exists.


Family 3 โ€” AI / hyperscaler custom transports (the new generation)โ€‹

Each major hyperscaler hit RoCEv2's ceiling and built their own transport. The common pattern: packet spraying for multipath, built-in encryption, out-of-order delivery with hardware reassembly, and microsecond failover when a link or switch breaks.

ProtocolOwnerSubstrateLossless?MultipathEncryptionKey traitUsed by / for
MRC (Multipath Reliable Conn.)OpenAI + AMD/MS/NV/Broadcom/Intel โ€” OCPEthernet/IPNoYes (packet spray)Built-inEvolution of RoCE v2; ฮผs failover; verbs-compatOpenAI training, Microsoft Fairwater, Oracle Abilene
FalconGoogle โ€” OCPEthernet/IPNoYes (PLB)Built-in (PSP/IPSec)HW transport, multi-ULP (RDMA + NVMe)Google Cloud, Intel E2100 IPU
SRD (Scalable Reliable Datagram)AWSEthernet/IPNoYes (packet spray)Built-inOut-of-order delivery, hw-offloaded, libfabric APIAWS EFA โ€” EC2 P5, Trn1/Trn2, HPC
UET (Ultra Ethernet Transport)UEC consortium (open)Ethernet/IPNoYes (packet spray)Built-inOpen standard; ~75% from HPE Slingshot; libfabric 2.0Industry target โ€” 1M+ endpoint scale
Pony ExpressGoogle (legacy)Ethernet/IPNoLimitedOptionalSW-only predecessor to Falcon; ran in Snap microkernelOlder Google datacenter (superseded)
eRDMAAlibabaVPC/EthernetNoYesOptionalRDMA for cloud tenantsAlibaba Cloud ECS

The pattern is the same everywhere: drop the PFC dependency, spray packets across all available paths, hardware-offload reassembly. UET is the open-standard convergence target โ€” expect MRC and Falcon ideas to fold in over time.


Family 4 โ€” Scale-up interconnectsโ€‹

Scale-up is intra-server / intra-rack. Scale-out (everything above) is inter-server. Scale-up connects GPUs inside one logical box; scale-out connects boxes to other boxes. They are disjoint problems with different bandwidths (TB/s vs Gbps), different latencies (ns vs ฮผs), and different protocols.

ProtocolOwnerDomainKey traitUsed by / for
NVLink / NVSwitch / NVLink FabricNVIDIA (proprietary)Scale-up GPUUp to 1.8 TB/s per GPU; sub-ฮผsDGX, GB200 NVL72, HGX
UALinkAMD / Broadcom / Cisco / Google / HPE / Intel / Meta / MS โ€” openScale-up GPUOpen NVLink alternative; v1.0 in 2025Future open AI servers
SUE (Scale-Up Ethernet)BroadcomScale-up GPUSimpler than UET; โ‰ค1.6 Tbps, ~100 ns device latencyBroadcom AI silicon
ICI (Inter-Chip Interconnect)GoogleScale-up TPUNative TPU pod fabricGoogle TPU v4 / v5p / Trillium pods
Slingshot / Portals 4HPE (Cray)HPC scale-outAdaptive routing; UET 1.0 lineage (~75%)Frontier, El Capitan, Aurora, leadership HPC
OmniPathCornelis Networks (ex-Intel)HPC scale-outInfiniBand-style fabricSome HPC sites
RDS (Reliable Datagram Sockets)OracleCluster IPCReliable datagrams over IB / RoCE / TCPOracle RAC interconnect
TIPCEricsson / LinuxCluster IPCTopology-aware cluster messagingTelecom clusters

GB200 NVL72 puts 72 GPUs on one NVLink Switch fabric โ€” that's one logical machine over scale-up. RDMA only takes over at the rack boundary.


Mental model โ€” the 6-point synthesisโ€‹

  1. Classic IP transports cover the internet. TCP, UDP, QUIC. Universal but kernel-bound.
  2. RDMA family (IB, RoCE v2, iWARP) covers traditional HPC/AI โ€” but needs a lossless fabric (PFC) and doesn't multipath well.
  3. Each hyperscaler built a custom multipath transport because RoCEv2 doesn't scale to 100K+ GPUs: Google โ†’ Falcon, AWS โ†’ SRD, OpenAI/Microsoft/NVIDIA/AMD โ†’ MRC, Alibaba โ†’ eRDMA.
  4. UET is the open-standard convergence target. Expect MRC and Falcon ideas to fold into it over time.
  5. Scale-up (NVLink / UALink / SUE / ICI) is intra-server and disjoint from scale-out transports. Different domain, different physics, different protocols.
  6. Congestion control matters as much as the transport. Most modern AI fabrics combine packet spraying + delay-based CC + ECN/INT signals + microsecond failover.

Who built whatโ€‹

TechOwner / Standards bodyWhen
TCP/IPIETF (DARPA)1973โ€“1980s
UDPIETF1980
InfiniBand specIBTA consortium1999
Mellanox IB siliconMellanox Technologies (Israel)1999 โ†’ acquired by NVIDIA 2019
RDMA Verbs APIIBTA / OpenFabrics Alliance2000s
SCTPIETF2000
RoCE v1IBTA2010
QUICGoogle โ†’ IETF2012 / RFC 9000 in 2021
MPTCPIETF2013
RoCE v2IBTA2014
DCQCNMicrosoft ResearchSIGCOMM 2015
Pony ExpressGoogle (legacy)~2014โ€“2023
AWS EFA / SRDAWS2018+
eRDMAAlibaba2020
Spectrum-XNVIDIA2023
Ultra Ethernet ConsortiumAMD, Arista, Broadcom, Cisco, HPE, Intel, Meta, Microsoft, +50 othersFounded 2023, spec 1.0 in 2025
FalconGoogle + Intel (E2100 IPU)2023
MRCOpenAI + AMD + Microsoft + NVIDIA + Broadcom + Intel (OCP)2024
UALink ConsortiumAMD, Broadcom, Cisco, Google, HPE, Intel, Meta, Microsoft2024 / v1.0 in 2025
SUE (Scale-Up Ethernet)Broadcom2024
ICIGoogleTPU v4 era (~2018)
SlingshotHPE (Cray)2019

2. Congestion Controlโ€‹

What is congestion control?โ€‹

Congestion control is what the network does when more traffic is offered than a link can carry. It's not "if the link is full" โ€” it's "the link is about to be full, and we need to decide who slows down before packets start dropping or queues blow up."

In AI fabric this matters because one congested link can stall a synchronized collective (AllReduce, AllGather) across the entire job. A 0.1% packet loss rate is fine for TCP. It's a 10ร— throughput hit for RDMA.

What you already knowโ€‹

You've shipped this in production:

  • TCP congestion control โ€” slow start, congestion-avoidance, fast-retransmit; Reno โ†’ Cubic โ†’ BBR
  • ECN (RFC 3168) โ€” switch marks the IP header at congested egress; receiver echoes back; sender slows down
  • RED / WRED โ€” random early drop; the switch starts dropping before the queue is full
  • PFC (IEEE 802.1Qbb) โ€” link-level pause frames per priority class; backpressure instead of drop
  • QoS toolkit โ€” DSCP, COS, queue scheduling, buffer profiles, headroom, watermarks

AI fabric uses these same primitives โ€” but the closed-loop algorithm running on the NIC is different. Where TCP backs off after a packet drops, RDMA fabric uses ECN to warn the sender before drop happens, and the NIC adjusts its send rate proactively. PFC is the safety net underneath.

A mental model:

  • PFC = backpressure (pull the emergency brake โ€” stop sending on this priority right now)
  • ECN = early warning (the dashboard light flashes โ€” slow down, congestion ahead)
  • CC algorithm = the driver's foot โ€” uses ECN signals to dial back the rate before PFC has to fire

How congestion control evolvedโ€‹

Congestion control evolutionTCP Reno / NewReno1988ECN (RFC 3168)2001CUBIC2006BBR2016DCTCP2010TIMELY2015โ€“2020DCQCN2015HPCC2019Swift2020PowerTCP2022AWS SRD CC2018Spectrum-X CC2023Falcon CC2023MRC CC2024UET CC2025NDP / Trim2017Homa201819851990200020102015202020261988 ยท TCP Reno (Berkeley/IETF)2001 ยท ECN standardized2010 ยท DCTCP paper2015 ยท DCQCN + TIMELY2016 ยท BBR (Google)2019 ยท HPCC paper2023 ยท Spectrum-X / Falcon CC2025 ยท UET CC
Active spec / standardVendor / proprietarySuperseded

Era-by-era, in one line:

  • 1988 โ€” TCP slow start + congestion avoidance (Berkeley / IETF). The field is born.
  • 2001 โ€” ECN standardized (RFC 3168). The network can mark, not just drop.
  • 2006 โ€” CUBIC becomes Linux default. Cubic growth function โ€” fills bandwidth aggressively.
  • 2010 โ€” DCTCP (Microsoft / Stanford, SIGCOMM). Fine-grained ECN-based CC for datacenter TCP. Proves precise ECN works.
  • 2015 โ€” DCQCN (Microsoft Research, SIGCOMM). DCTCP's ideas applied to RoCE. Canonical RDMA CC.
  • 2015 โ€” TIMELY (Google, SIGCOMM). RTT-gradient-based. Delay, not ECN, as the signal.
  • 2016 โ€” BBR (Google). Model-based; fills bottleneck without filling buffers. Now in YouTube, QUIC, Linux kernel.
  • 2017 โ€” NDP / Trim (Cambridge research). Switch trims payload on congestion instead of dropping.
  • 2018 โ€” AWS SRD CC ships with EFA. Path-level feedback; sub-10 ms RTO.
  • 2018 โ€” Homa (Stanford research). Receiver-driven, priorities + grants; eliminates HoL via message-oriented design.
  • 2019 โ€” HPCC (Alibaba, SIGCOMM). Uses In-band Network Telemetry (INT) for precise per-link telemetry.
  • 2020 โ€” Swift (Google, SIGCOMM). Successor to TIMELY. Decomposes host vs fabric latency. Becomes Falcon's CC core.
  • 2022 โ€” PowerTCP (NSDI). Bandwidth ร— queue depth combined signal.
  • 2023 โ€” Spectrum-X CC (NVIDIA). Switch + NIC co-designed. Closed system.
  • 2023 โ€” Falcon CC = Swift + CSIG + Carousel (Google + Intel E2100). HW per-flow shaping, multipath PLB.
  • 2024 โ€” MRC CC (OpenAI / Microsoft / NVIDIA / AMD / Broadcom / Intel). Programmable CC + ฮผs rerouting.
  • 2025 โ€” UET CC (Ultra Ethernet Consortium). Two-sided CC for packet-sprayed environment.

The arc:

Congestion control moved from a reactive software algorithm (TCP) to a proactive, hardware-co-designed, fabric-wide control loop (DCQCN, HPCC, UET).

CC algorithm familiesโ€‹

TCP family โ€” what runs the internetโ€‹

AlgorithmFamilySignalKey traitWhere used
Reno / NewRenoTCPLossClassic AIMD; baselineLegacy TCP everywhere
CUBICTCPLossCubic growth โ€” default in Linux/WindowsMost internet TCP today
Vegas / WestwoodTCPDelay / bw-estDelay-based; low queueingNiche / research
Compound TCPTCPLoss + delayHybridOlder Windows
BBR v1 / v2 / v3TCP / QUICBandwidth + RTTModel-based; fills bottleneck without filling buffersGoogle services, YouTube, QUIC, Linux kernel

Datacenter & RDMA family โ€” built for tight datacenter networksโ€‹

AlgorithmFamilySignalKey traitWhere used
DCTCPDC TCPECN markingProportional reaction to ECN; small queuesMicrosoft, Linux DC TCP stacks
DCQCNRoCE v2ECN + PFCDefault RoCE v2 CC; rate-basedAzure, Meta, Tencent โ€” most RoCE clusters
TIMELYRDMADelay (RTT gradient)Delay-based, CPU-lightGoogle early RDMA
SwiftRDMA / FalconDelay (NIC RTT)Decomposes host vs fabric latency; basis for FalconGoogle Falcon
HPCCRDMAIn-band telemetry (INT)Precise rate using switch INT dataAlibaba
PowerTCPDC TCP / RDMAPower = bw ร— queueCombines bandwidth and queue depth signalsResearch / select DCs

AI / hyperscaler custom CC โ€” the new generationโ€‹

AlgorithmFamilySignalKey traitWhere used
MRC CCMRCMultipath telemetryProgrammable CC + microsecond reroutingOpenAI / Microsoft Fairwater / Oracle Abilene
Falcon CC (Swift + CSIG + Carousel)FalconDelay + congestion sig.HW per-flow shaping, multipath PLBGoogle + Intel E2100
SRD CCSRDPath-level feedbackAvoids overloaded paths; <10 ms RTOAWS EFA
UET CCUETSender + receiver basedTwo-sided CC for packet-sprayed environmentUltra Ethernet 1.0
Spectrum-X CCSpectrum-XSwitch + NIC telemetrySwitch + NIC co-designedNVIDIA Spectrum-X

Research / exoticโ€‹

AlgorithmFamilySignalKey traitWhere used
HomaReceiver-drivenPriorities + grantsEliminates HoL via priorities; message-orientedStanford research, influential
NDP / TrimSwitch-assistedHeader trimmingSwitch trims payload on congestion; no whole-packet dropCambridge research
ExpressPassCredit-basedReceiver creditsReceiver paces with credit packetsResearch
EQDSEdge-queuedEdge-based shapingPushes queues to edges, not coreCambridge / UCL research

Mental modelโ€‹

Most modern AI fabrics combine four ideas: packet spraying + delay-based CC + ECN/INT signals + microsecond failover.

Pick any production AI-fabric CC algorithm โ€” MRC, Falcon, SRD, UET CC, Spectrum-X CC โ€” and you'll find some combination of these four. The PFC-only era is ending.

Who built whatโ€‹

TechOwner / Standards bodyWhen
TCP Reno / NewRenoBerkeley / IETF1988
ECN (RFC 3168)IETF2001
CUBICNCSU โ†’ Linux2006
DCTCPMicrosoft / StanfordSIGCOMM 2010
DCQCNMicrosoft ResearchSIGCOMM 2015
TIMELYGoogleSIGCOMM 2015
BBRGoogle2016
NDP / TrimCambridge2017
AWS SRD CCAWS2018
HomaStanford2018
HPCCAlibabaSIGCOMM 2019
SwiftGoogleSIGCOMM 2020
PowerTCPResearch consortiumNSDI 2022
Spectrum-X CCNVIDIA2023
Falcon CCGoogle + Intel2023
MRC CCOpenAI + Microsoft + NVIDIA + AMD + Broadcom + Intel2024
UET CCUltra Ethernet Consortium2024โ€“2025

๐Ÿ“„ Transports & Congestion Control โ€” One-Pager โ€” same content, denser, single-sheet print format


What this curriculum walksโ€‹

Of every option above, the course teaches:

  • Transport: RoCEv2
  • Congestion control: DCQCN + PFC/ECN

Why this combination: it's the most-deployed RDMA-on-Ethernet pattern in 2026, vendor-neutral, well-documented in public standards (IBTA, IEEE), and what you'll see at most hyperscalers running training on Ethernet today.

If your stack differs (you run InfiniBand, UEC, MRC, Falcon, SRD, or another custom transport), the same protocol vocabulary still applies โ€” you swap the implementation, not the concepts.

The other layers of the AI fabric โ€” host networking, Kubernetes, GPU drivers, topology โ€” are covered in their own phases of the curriculum.


Where to nextโ€‹