Skip to main content

The curriculum

AI networking, distilled — for network engineers.

This is a public learning hub for engineers crossing from traditional DC fabrics into AI training and inference networks. If you already know BGP, ECMP, CLOS — and now find yourself supporting GPU clusters that look nothing like a data center — this is for you.

Modules are written, polished, and shipped one at a time. No fixed roadmap, no fake module count, no marketing-driven content schedule. They drop when they're ready and not before.

What you'll learn

The curriculum walks the full stack a network engineer touches in an AI fabric:

  • RDMA & RoCE on Ethernet — verbs, queue pairs, memory regions, kernel-bypass, why one dropped packet can stall a training job
  • Switch QoS — PFC, ECN, DCQCN, HPCC, buffer profiles and the configs you actually deploy
  • Buffers & congestion — shared pools, headroom, watermarks, where queues live and what makes them spill
  • Host networking — SR-IOV, Multus, NCCL, GPU operator, getting the network inside the container
  • Topology & sizing — rail-optimized fabrics, NVLink/NVSwitch boundaries, scaling from 32 GPUs to thousands
  • Production operations — RCA, telemetry, failure modes vendors don't print, the playbooks that survive 3 AM

How this is built

  • Vendor-neutral. NVIDIA, Broadcom, Cisco, Arista, Juniper — evaluated on technical merit
  • Source material. Public RFCs, IEEE standards, vendor docs, academic papers, home-lab and production experience
  • Free. Apache 2.0 / CC BY 4.0 — share, remix, build on it
  • Personal views. Not affiliated with any employer

Stay in the loop

The fastest way to follow along is the blog — that's where new modules get announced, plus field notes, RFC walk-throughs, and incident write-ups in between releases.

Watch the GitHub repo to track what's in flight.


This page will fill out as the curriculum gets built. Module list, prerequisites, time estimates, and labs all land here once the first phase is published.