Skip to main content

Life of an AI Job in Fabric

When you type google.com into a browser, you can recite the lifecycle from memory — DNS, TCP handshake, TLS, HTTP request, response, render. This page does the same for an AI training job. Five clean states, every event in order, every protocol named.

First: four physical networks in a production AI cluster

A real AI cluster has four physically separate networks, each with its own leaves and spines. They share the same datacenter floor but never share switches:

NetworkSpeedWho's on itCarries
🟣 Backend Network400 G RoCEv2GPU servers (4–8 NICs each, one per GPU rail)Collective communication — AllReduce, AllGather. Lossless via PFC + ECN.
🟢 Frontend Network100 G / 25 GX86 head nodes, k8s control planekubectl, API server, etcd, image pulls, monitoring, NCCL TCP rendezvous, SSH
🔵 Storage Network100 GX86 head nodes + storage serversNVMe-oF batches in, checkpoints out — its own dedicated fabric
Out-of-Band (OOB)1 GAll switches + servers (BMC/IPMI ports)Switch management, hardware health, firmware updates — totally separate from data path

GPU servers sit on the Backend only (for collectives). X86 head nodes sit on both Frontend + Storage (control plane and data access). Storage servers sit on the Storage Network only. Every switch and server is also reachable on the OOB for management.

NVLink is not on this picture — it's a GPU-to-GPU bus inside each server chassis (silicon-level, ~900 GB/s per H100). It never crosses any external network.

Backend (RoCEv2)
Frontend (TCP)
Storage Network
OOB (mgmt — not on this animation)
1. submit → schedule1 / 5
Frontend Network

Submit → Schedule → Pods running

From kubectl apply to four containers actually running. Pure control plane — Kubernetes API, etcd, scheduler, kubelet, container runtime — all on the Frontend Network. The Backend (RoCEv2) and Storage networks are completely idle.

👤 Youkubectl🧭 K8s Schedgang schedulerNode 1GPUs idleNode 2GPUs idleNode 3GPUs idleNode 4GPUs idlescheduler places 4 pods (gang scheduling)
Transport
Frontend (TCP)
Utilization
1.5%
Congestion
0%
PFC (802.1Qbb)
inactive
What the network sees
  • 1. You run kubectl apply -f myjob.yaml. The spec says: nodes: 4, gpus: 4/node, rdma/hca: 4, container image.
  • 2. kube-apiserver validates the PyTorchJob and runs the Training Operator's admission webhook (checks GPU + RDMA quotas).
  • 3. Spec is written to etcd. Scheduler is notified of 4 pending pods.
  • 4. Scheduler does gang scheduling — all 4 pods placed atomically, or the job waits. No partial allocation.
  • 5. Topology-aware placement: prefers nodes with the same topology.kubernetes.io/rack label to minimise cross-leaf traffic.
  • 6. kubelet on each chosen node receives the pod spec and pulls the container image (nvcr.io/nvidia/pytorch:24.04-py3, ~10 GB).
  • 7. Device plugins mount /dev/nvidia0..3 and /dev/infiniband/uverbs0..3 into the container namespace.
  • 8. Containers start. Pods running.
  • Key takeaway: the compute fabric saw zero RDMA traffic this entire state. Everything was TCP control plane.
click the dots to jump

The five states at a glance — which network is active

#StateActive network(s)Util
1Submit → Schedule → Pods runningFrontend only~1.5%
2NCCL bootstrapFrontend (TCP rendezvous) → Backend (QPs come up)~5%
3Data ingest + Forward passStorage Network + NVLink intra-server~28%
4Backward + AllReduceBackend Network~98%
5Optimizer + Checkpoint + LoopStorage Network (write back)~42%

Notice: the Backend Network — the expensive 400 G RoCEv2 monster — only does real work during State 4. It's dark ~70% of the time. Three of the four networks see almost no traffic from a job's collective phase. That's why AI fabric design is all about the Backend, and specifically about State 4.

Three things a network engineer should walk away with

  1. The Backend Network only matters during AllReduce (State 4). Everything before it is setup; everything after is local or storage. Design and troubleshoot the Backend with State 4 as the north star.
  2. AllReduce is synchronised, not random. All 16 GPUs post RDMA WRITEs in lockstep. That's the traffic shape that breaks ECMP and motivates rail-optimised topology + adaptive routing. See Hash Polarization.
  3. One iteration ≈ hundreds of milliseconds. Training does ≈1 million of them. A 1 ms tail delay per iteration compounds into hours of wasted GPU time. Tail latency is the design constraint, not average throughput.