DESIGN 05 / 05 โ€” HYBRID: KUBERNETES + SLURM + RoCE (HIGH-LEVEL HPC ARCHITECTURE)
Best for: Large enterprises needing both cloud-native and traditional HPC on one cluster [cite: Overview]
4.ML/AI APPLICATIONS
& TRAINING (DUAL)
[cite: L7, L10, L12, L15]
โ˜ธ KUBERNETES WORKLOADS
๐Ÿ“ฆ
[cite: L7]
Containerized Pods + NCCL
K8s pods with
GPU + RDMA access
โ†’
๐Ÿงฎ
[cite: L12]
NCCL all-reduce
K8s env vars for
rank discovery
โ‡„
โ–  SLURM WORKLOADS
๐Ÿ’ป
[cite: L10]
Native Batch Jobs + NCCL
Slurm SBATCH
System-wide NCCL
โ†’
๐Ÿงฎ
[cite: L12]
NCCL all-reduce
SLURM_PROCID
rank discovery
โ†’
โœ…
[cite: L15]
Both Save to Shared Storage
Same Lustre/NFS
Checkpoints unified
โฌ‡
3.DUAL ORCHESTRATION
(K8s + SLURM)
[cite: L6, L7, L8.5]
โ˜ธ KUBERNETES (K8s Nodes)
โ˜ธ๏ธ
[cite: L6]
K8s Master + kubelets
NVIDIA device plugin
Multus + SR-IOV
kubectl apply -f job.yaml
โš–๏ธ
Volcano / Kueue
Resource Coordinator
Prevents GPU conflicts
Fair-share scheduling
[cite: L8.5]
โ–  SLURM (Slurm Nodes)
๐Ÿ–ฅ๏ธ
[cite: L7]
slurmctld + slurmd
GRES GPU tracking
Batch job scheduler
sbatch train_job.sh
โฌ‡
2.SHARED NETWORKING LAYER
(RoCE FABRIC)
[cite: L5, L6, L13, L14]
โš™๏ธ
[cite: L5]
Shared RoCE Fabric
Same switches for
BOTH K8s & Slurm
โ†’
๐Ÿ”€
[cite: L6]
Ethernet Switches
Leaf-Spine
PFC + ECN + QoS
โŸถ
๐Ÿ”€
[cite: L5]
QoS Traffic Separation
K8s RDMA traffic class
Slurm RDMA traffic class
โŸถ
๐Ÿ”—
[cite: L13]
RDMA Queue Pairs
K8s: via SR-IOV VFs
Slurm: via Physical NICs
โŸถ
800 Gbps
800 Gbps
RDMA
โšก SHARED RDMA FABRIC
โฌ‡
1.INFRASTRUCTURE
& BARE METAL
[cite: L1-L4, L8]
๐Ÿง
[cite: L1]
Install Linux OS
All B300 servers
Bare metal foundation
โ†’
B300 Servers
โ†’
๐ŸŽฎ
[cite: L2]
Install NVIDIA Drivers & CUDA
System-wide CUDA
All nodes
โ†’
๐Ÿ”€
[cite: L3]
Partition GPUs
K8s nodes: SR-IOV VFs
Slurm nodes: Physical NICs
OR use NVIDIA MIG
โ†’
๐Ÿ”Œ
[cite: L4]
Enable SR-IOV on K8s Nodes
16 VFs per NIC
Slurm nodes use physical
STORAGE NETWORK
[cite: L8]
🗂 Shared Storage
NFS / Lustre
Checkpoints & Datasets
DESIGN 05 / 05 ยท HPC RoCE Design Patterns