๐ฆ
[cite: L7]
NCCL in Container Images
Bake NCCL into Docker
No SR-IOV โ simpler setup
โ
โ
๐งฎ
[cite: L12]
Training Code Calls NCCL
Backward pass triggers
all-reduce across GPUs
โ
๐
[cite: L12.5]
NCCL Rank Discovery
Via Kubernetes env vars
RANK, WORLD_SIZE
โ
๐
[cite: L13]
Initialize Comm Topology
Ring or Tree all-reduce
over physical NICs
โ
โ
[cite: L15]
Training Job Completes
Pods tear down, GPUs
released, checkpoint saved