Life of an AI Job in Fabric
What actually happens on the wire when an AI training job runs — five clear states from kubectl submit to AllReduce to checkpoint, mapped to the four production networks (Backend, Frontend, Storage, OOB).
End-to-end animated walk-through of a training job moving through the fabric — from kubectl submission, through gang scheduling, container launch, NCCL rendezvous, Queue Pair setup, memory registration, forward and backward pass, ring AllReduce over RoCEv2, optimizer step, checkpoint, and iteration loop. Twenty-four steps across six phases, all driven by the concepts in the prior sections.
What actually happens on the wire when an AI training job runs — five clear states from kubectl submit to AllReduce to checkpoint, mapped to the four production networks (Backend, Frontend, Storage, OOB).