DESIGN 02 / 05 โ€” BARE METAL + SLURM + RoCE (HIGH-LEVEL HPC ARCHITECTURE)
Best for: Dedicated research clusters, traditional HPC workloads [cite: Overview]
4.ML/AI APPLICATIONS
& TRAINING
[cite: L7, L12-L13, L15]
๐Ÿ’ป
[cite: L7]
Load Training Code
System-wide NCCL library
No containers โ€” native OS
โ†’
Native Training Code
GPU
GPU
โ†’
๐Ÿงฎ
[cite: L12]
Training Code Calls NCCL
Backward pass triggers
NCCL all-reduce
โ†’
๐Ÿ”
[cite: L12.5]
NCCL Rank Discovery
Via SLURM_PROCID
and SLURM_NTASKS
โ†’
๐ŸŒ
[cite: L13]
Initialize Comm Topology
Ring or Tree all-reduce
topology across nodes
โ†’
โœ…
[cite: L15]
Training Job Completes
Code exits, resource
freed, checkpoint saved
โฌ‡
3.SLURM JOB
SCHEDULER
[cite: L6, L6.5, L9, L10]
๐Ÿ“ฆ
[cite: L6]
Install Slurm
slurmctld + slurmd
via packages or source
โ†’
๐Ÿ“Š
[cite: L6.5]
Configure Slurm GRES
GPU resource tracking
Gres=gpu:a100:8
โ†’
๐Ÿ–ฅ๏ธ
[cite: L6]
Central Controller
(slurmctld Master)
slurmd
slurmd
โ†’
๐Ÿ“‹
[cite: L9]
Create SBATCH Job Scripts
Resource requirements
GPU count, nodes, time
โ†’
๐Ÿ“Œ
[cite: L10]
Scheduler Allocates Nodes
Exclusive GPU + node
allocation per job
โฌ‡
2.NETWORKING LAYER
(RoCE FABRIC)
[cite: L5, L6, L11, L13.5, L14]
๐Ÿ–ง
[cite: L5]
Configure Physical NIC IPs
800G NICs with IPs
No SR-IOV needed
โ†’
๐Ÿ”€
[cite: L6]
Ethernet Switches
Leaf-Spine Topology
PFC + ECN + QoS
โŸถ
๐Ÿ”€
[cite: L11]
Allocate Physical Paths
Direct paths between
allocated nodes & GPUs
โŸถ
๐Ÿ”—
[cite: L13.5]
NCCL Creates RDMA Queue Pairs
Direct between
physical NICs
โŸถ
800 Gbps
800 Gbps
RDMA
โšก DIRECT RDMA PATH
โฌ‡
1.INFRASTRUCTURE
& BARE METAL
[cite: L1-L3, L8]
๐Ÿง
[cite: L1]
Install Linux OS
Ubuntu 22.04 LTS
Directly on hardware
โ†’
B300 Servers
NVIDIA
B300
โ†’
๐ŸŽฎ
[cite: L3]
Install NVIDIA Drivers & CUDA
System-wide CUDA
All 8 GPUs per node
โ†’
๐Ÿ”Œ
[cite: L2]
Enable Physical NICs via BIOS
Single large physical NIC
No virtual functions
STORAGE NETWORK
[cite: L8]
🗂 Shared Storage
NFS / Lustre
Object Storage
Checkpoints & Datasets
DESIGN 02 / 05 ยท BARE METAL + SLURM + RoCE ยท HPC RoCE Design Patterns