20.1 Life of an AI Job in Fabric
What actually happens on the wire when an AI training job runs — five clear states from kubectl submit to AllReduce to checkpoint, mapped to the four production networks (Backend, Frontend, Storage, OOB).
End-to-end animated walk-through of a training job moving through the fabric — from kubectl submission, through gang scheduling, container launch, NCCL rendezvous, Queue Pair setup, memory registration, forward and backward pass, ring AllReduce over RoCEv2, optimizer step, checkpoint, and iteration loop. The capstone — every concept from prior chapters in motion.
What actually happens on the wire when an AI training job runs — five clear states from kubectl submit to AllReduce to checkpoint, mapped to the four production networks (Backend, Frontend, Storage, OOB).