Skip to main content

Hello, Lossless Network

· 2 min read
Staff Network Engineer · RDMA & AI Fabric

TLDR: New site, built for network engineers entering AI. Deep modules + fast blog + zero vendor noise. First module drops soon.


Welcome to Lossless Network — AI networking, distilled for network engineers.

If you've ever tried to design or operate a network fabric for large-scale AI training, you've felt the gap. The standards are public. The vendor whitepapers exist. But nowhere is there a single, opinionated, technically honest walkthrough of how the pieces fit together — written by a network engineer, for network engineers — and why some choices that look right on paper fall apart at scale.

This site is my attempt to fix that.

Who this is for

You're a network engineer who:

  • Builds and operates production data center fabrics
  • Has been told to "support AI workloads" — which now means RoCEv2, lossless Ethernet, NCCL collectives, and topologies that look nothing like CLOS
  • Wants to actually understand what's happening on the wire when 1,024 GPUs run all-reduce simultaneously
  • Refuses to pretend a vendor slide deck is a design document

If that's you, you're home.

The promise

Deep when you need depth. Fast when you need speed.

  • Got 30 seconds? Read the TLDR at the top of every post. That's all you need.
  • Got 5 minutes? Read the blog. Sharp takes, no filler.
  • Got an afternoon? Take a module. First principles to production deployment, end to end.

You decide how deep to go. The content respects your time.

What's coming

Six modules, written in order:

  1. RDMA Fundamentals — verbs, QPs, MRs, and why kernel-bypass exists
  2. RoCEv2 & Lossless Ethernet — PFC, ECN, DCQCN, and what makes RDMA work over Ethernet
  3. AI Fabric Architecture — rail-optimized topologies, NCCL, the all-reduce bottleneck
  4. Congestion Control — the actual tuning, with numbers
  5. Adaptive Routing — DLB, FLB, and why static ECMP kills GPU jobs
  6. UEC & The Future — what comes after RoCE

Plus a blog for everything that doesn't fit the module structure — field notes, debugging stories, paper reviews, and "the thing the vendor didn't tell you" posts.

What this site is not

  • A vendor pitch
  • A reskinned wiki dump
  • AI-generated filler

Every word is written by hand, reviewed by hand, and grounded in real engineering experience.

Stay updated

The blog has an RSS feed. New modules drop one at a time. Follow along, push back when I'm wrong, and let's build the resource the AI networking field has been missing.

— Nagarjun