Skip to main content

About

I'm Nagarjun Velmurugan — Staff Network Engineer building AI fabric.

Packet loss used to mean a slow webpage. Now it means a $50,000 training job dies.

That one line captures why I built this site, and why it matters.

Where this started

When I first started learning AI networking, the first thing I did was search the internet for solid material. What I found was scattered, outdated, or written for ML engineers — not network engineers. No single place pulled RDMA, RoCE, PFC, SR-IOV, NCCL, and Kubernetes networking together in a structured way. No recent, end-to-end resource existed.

That's when it hit me: if I'm struggling to find this, every network engineer walking into AI infra is struggling too. So I started writing it — in our own dialect, with theory and labs side by side, so the muscle memory builds the way it actually does at 2 AM. Not a slide deck. Not a marketing site. The playbook I wish existed when I started.

Every hyperscaler is hiring AI Network Engineers. The fabrics are getting built. But there's still no structured public training for this role. Lossless Network is my attempt to close that gap — open, free, in public.

What I work on

By day: the network infrastructure that powers large-scale AI training and inference. RoCE v2, lossless Ethernet, congestion control, GPU-server fabrics — the layer between the accelerator and the rest of the data center.

By night, here: deep modules on how the pieces actually fit together, plus field notes, paper walk-throughs, and incident write-ups in the blog. Modules drop one at a time as they're polished — no fixed schedule, no marketing deadlines.

What makes this different

This isn't a slide deck or a wiki. The site is built so you run the same commands you'd run in production. Add tc netem delay — the next RDMA test shows degraded bandwidth. Misconfigure PFC — training collapses. You feel it, then you fix it.

Everything is explained in the dialect of a network engineer:

  • PFC is backpressure you already know
  • SR-IOV is VRFs for NICs
  • Kubernetes namespaces are VRFs
  • Services are VIPs
  • DaemonSets are LLDP agents running on every node

No ML jargon without a translation. Written as storytelling — easy to relate, easy to remember.

You don't study Lossless Network. You experience it.

Built to be contributed to

Found a gap? Have a better way to explain PFC? Know a break-fix scenario that's missing? Open a PR. The more engineers contribute, the stronger this gets — for us and for every network engineer who walks into AI infra after us.

Connect

If something on this site is wrong, off, or out of date — send a note via GitHub. The content improves only when readers push back.