DESIGN 01 / 05 — KUBERNETES + SR-IOV + RoCE (HIGH-LEVEL HPC ARCHITECTURE)
Best for: Large shared clusters, multi-tenant environments [cite: Overview]
4. ML/AI APPLICATIONS
& TRAINING
[cite: L7, L12–L13, L15]
📦
[cite: L7]
NCCL in Container Images
Bake NCCL into Docker images.
pytorch:24.01-py3 base image
Pod
GPU
GPU
🧮
[cite: L12]
Training Code Calls NCCL
Backward pass triggers
all-reduce across GPUs
🔍
[cite: L12.5]
NCCL Rank Discovery
Reads RANK, WORLD_SIZE
MASTER_ADDR env vars
🌐
[cite: L13]
Initialize Comm Topology
Ring or Tree all-reduce
across all GPU ranks
[cite: L15]
Training Job Completes
Pods tear down, GPUs released
Checkpoint → Shared Storage
3. KUBERNETES
ORCHESTRATION
[cite: L3.5, L4, L9, L10]
☸️
[cite: L4]
Install Kubernetes
kubeadm init
Master + Worker nodes
📡
[cite: L3.5]
GPU Resource Tracking
NVIDIA Device Plugin
DaemonSet on every node
☸️
[cite: L4]
Cluster Control Plane
(Master Node)
☸ Kubelet
☸ Kubelet
📋
[cite: L9]
Create Pod Specs
GPU requests + SR-IOV VF
annotations + NCCL env vars
📌
[cite: L10]
Scheduler Places Pods
Places pods based on
GPU availability data
2. NETWORKING LAYER
(RoCE FABRIC)
[cite: L4.5, L5, L6, L11, L13.5, L14]
🔌
[cite: L4.5]
Multus CNI Plugin
Enables multiple NICs per pod.
Without Multus: only 1 NIC!
⚠ Critical:
SR-IOV needs
Multus!
🔀
[cite: L6]
Ethernet Switches
Leaf-Spine Topology
🖧
[cite: L11]
CNI Assigns SR-IOV VFs & IPs
Each pod gets dedicated
VF + IP for RDMA traffic
🔗
[cite: L13.5]
NCCL Creates RDMA Queue Pairs
Send Q + Receive Q +
Completion Q per GPU pair
800 Gbps
800 Gbps
RDMA
⚡ DIRECT RDMA PATH
1. INFRASTRUCTURE
& BARE METAL
[cite: L1–L3, L8]
🐧
[cite: L1]
Install Linux OS
Ubuntu 22.04 LTS
Directly on hardware
NVIDIA B300
NVIDIA
B300
Servers
🎮
[cite: L3]
Install NVIDIA Drivers & CUDA
CUDA Toolkit 12.3
nvidia-smi verification
[cite: L2]
Enable SR-IOV on 800G NICs
Up to 16 VFs per NIC
sriov_numvfs = 16
STORAGE NETWORK
[cite: L8]
🗄 Shared Storage
NFS / Lustre
Object Storage
Checkpoints
Datasets