What This Curriculum Picks
Of every option in the prior two pages, the course teaches one combination:
Why this combination
- Most-deployed RDMA-on-Ethernet pattern in 2026. Azure, Meta, Tencent, ByteDance, Baidu, and most enterprises running training on Ethernet are running this.
- Vendor-neutral. RoCEv2 is an IBTA spec; PFC is IEEE 802.1Qbb; ECN is RFC 3168; DCQCN is a public SIGCOMM paper. You can build it on NVIDIA, Broadcom, Cisco, Arista, or Juniper silicon and the protocol surface is the same.
- Public standards, public research. Everything you'll learn is documented in RFCs, IEEE standards, and academic papers — not under NDA.
- The lingua franca. It's what you'll encounter at most hyperscalers and AI clouds running training on Ethernet today. Once you know this stack, the others become "swap the implementation, keep the concepts."
If your stack differs
You might be running:
- InfiniBand (DGX SuperPOD, traditional HPC)
- UET (Ultra Ethernet 1.0 — new AI-Ethernet deployments)
- MRC (OpenAI / Microsoft Fairwater / Oracle Abilene)
- Falcon (Google Cloud)
- SRD (AWS EFA — P5, Trn1/Trn2)
- Spectrum-X (NVIDIA-integrated AI-Ethernet)
- eRDMA (Alibaba Cloud)
The same protocol vocabulary still applies. You swap the implementation, not the concepts. Verbs, queue pairs, memory regions, congestion signals, multipath, microsecond failover — these are universal. The bytes-on-the-wire format and the closed-loop algorithm change; the way you reason about the fabric does not.
What the rest of the curriculum covers
The other layers of an AI fabric — host networking, Kubernetes, GPU drivers, topology — are covered in their own phases of the curriculum. This section is one layer of the stack, picked because it's the foundation everything else hangs off.
Where to next
- The Curriculum → — back to the index
- About — who built this