Skip to main content
AI Networking · For Network Engineers

The network
the AI runs on.

You know BGP, ECMP, CLOS. Now your fabric carries gradients, not just packets.

RAIL-OPTIMIZED AI FABRIC · 32 GPUs · RoCE v2 · 800GSPINERAILHOST800GSPINE-01SPINE-02SPINE-03SPINE-04dgx-018 × B300dgx-028 × B300dgx-038 × B300dgx-048 × B300R01R02R03R04R05R06R07R08Rail R-N → GPU-N of every host · 8 rails · NCCL collective-friendly
Spine · Leaf · GPU — the shape of every AI training fabric

Sound familiar?

Your company is building GPU clusters. You're not in the room.
ML teams talk about NCCL, RDMA, PFC storms. You nod and Google later.
AI networking roles are opening everywhere. You don't qualify — yet.

After Lossless Network.

From Network Engineer to AI Network Engineer — left side shows a traditional network engineer surrounded by BGP, OSPF, leaf-spine, copper and fiber cables, switches and routers. Right side shows the same engineer surrounded by GPU servers, NVLink, ConnectX-8 NICs, SR-IOV virtual functions, Multus, Kubernetes pods, NCCL AllReduce ring, lossless RoCEv2 fabric, Prometheus, Grafana, and Volcano scheduler.
From left to right — from "what's that?" to "I built that."
"What's RDMA?"
Walk verbs · queue pairs · memory regions through NCCL
"Why is training slow?"
Diagnose PFC storms, ECN misconfig, NCCL timeouts
"I don't know Kubernetes"
Deploy Multus pods with RDMA interfaces
"Can't contribute to AI fabric design"
Design rail-optimized topology for 1000+ GPUs

How it works.

  1. 01

    Read the theory. In your language.

    Every concept bridged to networking you already know. No ML jargon without a translation.

    ExamplePFC = backpressure on your switch ports. SR-IOV = VRF for NICs. Multus = secondary network attachments. NCCL ring AllReduce = a token bus, but for gradients.
  2. 02

    See the full picture.

    The complete AI fabric, walked end to end. Annotated diagrams, every hop labeled.

    ExampleA single AllReduce flow: GPU₀ → NIC queue pair → rail-leaf → spine → rail-leaf → NIC → GPU₁ on the next host. Every queue, every counter, every place a packet can stall.
  3. 03

    Watch the commands. See real output.

    Recorded lab walk-throughs. Every command typed, every counter read, every output explained.

    Examplerdma link add rxe0 type rxe netdev eth0 brings SoftRoCE up on Linux. Then ib_write_bw saturates 100G across two namespaces. Pause, rewind, copy the command into your own host when you have the gear.
  4. 04

    Watch it break.

    Real failure modes, recorded and dissected. The wave, the counters, the timing.

    ExampleA PFC storm starts on one GPU node. Pause frames flood every uplink in two seconds. Training jobs across the fabric stall. Watch it propagate, then watch the diagnosis.
  5. 05

    Read the RCA.

    Incident write-ups. What broke, how it was caught, what fix shipped, what stayed broken.

    ExampleWhy a 256-GPU job ran at 60% throughput for 11 hours — one ECN profile off by 3 KB on a single leaf. The full diagnosis trail, the patch, the re-deploy.

The path.

See the curriculum →
  1. 01
    Phase 01

    The machine

    What AI training does and what it runs on.

    AI Training Basics — collectives, parallelism, MFU · GPU & Server Hardware — NVLink, NVSwitch, PCIe, NIC placement

  2. 02
    Phase 02Course spine

    The fabric

    The shape of the network around AI clusters. Course spine.

    AI Fabric Architecture — Clos, rail-optimised, hash polarisation · Life of an AI Job in Fabric — submit → AllReduce → checkpoint

  3. 03
    Phase 03

    What rides on the wire

    The protocols that move bytes between GPUs.

    HPC Networking — animated map of the three · RDMA — kernel bypass, verbs, QPs, MRs · InfiniBand — the native RDMA fabric · RoCE v2 — BTH, RETH, ICRC, ECMP via UDP src port

  4. 04
    Phase 04

    Making it lossless

    The lossless trick — and the configs that actually work.

    Transport & Congestion Control — design-space map, escalation ladder · Switch QoS — PFC, ECN, DCQCN, buffer profiles

  5. 05
    Phase 05

    Host & orchestration

    Plug the fabric into the GPU host.

    Host Networking — SR-IOV, Multus, nvidia-peermem, NCCL · Linux for Network Engineers · Kubernetes for Network Engineers

  6. 06
    Phase 06

    Build & operate

    Build it, run it, fix it at 3 AM.

    HPC Cluster Designs — the 15-layer stack + 5 provisioning patterns · Building a Training Cluster · Inference Networking · Production Operations · Cluster Build Guide — BoM, RCAs, runbooks

AI networking is being built right now.

Build the foundation while the field is still forming.