Life of an AI Job in Fabric

When you type google.com into a browser, you can recite the lifecycle from memory — DNS, TCP handshake, TLS, HTTP request, response, render. This page does the same for an AI training job. Five clean states, every event in order, every protocol named.

First: four physical networks in a production AI cluster

A real AI cluster has four physically separate networks, each with its own leaves and spines. They share the same datacenter floor but never share switches:

Network	Speed	Who's on it	Carries
🟣 Backend Network	400 G RoCEv2	GPU servers (4–8 NICs each, one per GPU rail)	Collective communication — AllReduce, AllGather. Lossless via PFC + ECN.
🟢 Frontend Network	100 G / 25 G	X86 head nodes, k8s control plane	kubectl, API server, etcd, image pulls, monitoring, NCCL TCP rendezvous, SSH
🔵 Storage Network	100 G	X86 head nodes + storage servers	NVMe-oF batches in, checkpoints out — its own dedicated fabric
⚫ Out-of-Band (OOB)	1 G	All switches + servers (BMC/IPMI ports)	Switch management, hardware health, firmware updates — totally separate from data path

GPU servers sit on the Backend only (for collectives). X86 head nodes sit on both Frontend + Storage (control plane and data access). Storage servers sit on the Storage Network only. Every switch and server is also reachable on the OOB for management.

NVLink is not on this picture — it's a GPU-to-GPU bus inside each server chassis (silicon-level, ~900 GB/s per H100). It never crosses any external network.

Backend (RoCEv2)

Frontend (TCP)

Storage Network

OOB (mgmt — not on this animation)

1. submit → schedule1 / 5

Frontend Network

Submit → Schedule → Pods running

From kubectl apply to four containers actually running. Pure control plane — Kubernetes API, etcd, scheduler, kubelet, container runtime — all on the Frontend Network. The Backend (RoCEv2) and Storage networks are completely idle.

Transport

Frontend (TCP)

Utilization

1.5%

Congestion

PFC (802.1Qbb)

inactive

What the network sees

1. You run kubectl apply -f myjob.yaml. The spec says: nodes: 4, gpus: 4/node, rdma/hca: 4, container image.
2. kube-apiserver validates the PyTorchJob and runs the Training Operator's admission webhook (checks GPU + RDMA quotas).
3. Spec is written to etcd. Scheduler is notified of 4 pending pods.
4. Scheduler does gang scheduling — all 4 pods placed atomically, or the job waits. No partial allocation.
5. Topology-aware placement: prefers nodes with the same topology.kubernetes.io/rack label to minimise cross-leaf traffic.
6. kubelet on each chosen node receives the pod spec and pulls the container image (nvcr.io/nvidia/pytorch:24.04-py3, ~10 GB).
7. Device plugins mount /dev/nvidia0..3 and /dev/infiniband/uverbs0..3 into the container namespace.
8. Containers start. Pods running.
Key takeaway: the compute fabric saw zero RDMA traffic this entire state. Everything was TCP control plane.

click the dots to jump

The five states at a glance — which network is active

#	State	Active network(s)	Util
1	Submit → Schedule → Pods running	Frontend only	~1.5%
2	NCCL bootstrap	Frontend (TCP rendezvous) → Backend (QPs come up)	~5%
3	Data ingest + Forward pass	Storage Network + NVLink intra-server	~28%
4	Backward + AllReduce	Backend Network	~98%
5	Optimizer + Checkpoint + Loop	Storage Network (write back)	~42%

Notice: the Backend Network — the expensive 400 G RoCEv2 monster — only does real work during State 4. It's dark ~70% of the time. Three of the four networks see almost no traffic from a job's collective phase. That's why AI fabric design is all about the Backend, and specifically about State 4.

Three things a network engineer should walk away with

The Backend Network only matters during AllReduce (State 4). Everything before it is setup; everything after is local or storage. Design and troubleshoot the Backend with State 4 as the north star.
AllReduce is synchronised, not random. All 16 GPUs post RDMA WRITEs in lockstep. That's the traffic shape that breaks ECMP and motivates rail-optimised topology + adaptive routing. See Hash Polarization.
One iteration ≈ hundreds of milliseconds. Training does ≈1 million of them. A 1 ms tail delay per iteration compounds into hours of wasted GPU time. Tail latency is the design constraint, not average throughput.

First: four physical networks in a production AI cluster​

Submit → Schedule → Pods running

The five states at a glance — which network is active​

Three things a network engineer should walk away with​

First: four physical networks in a production AI cluster

The five states at a glance — which network is active

Three things a network engineer should walk away with