DESIGN 04 / 05 โ€” BARE METAL + MPI + RoCE (HIGH-LEVEL HPC ARCHITECTURE)
Best for: HPC researchers, scientific computing, legacy MPI workloads [cite: Overview]
4.ML/AI APPLICATIONS
& TRAINING
[cite: L9, L12-L14, L15]
๐Ÿ
[cite: L9]
MPI Training Code
PyTorch MPI backend
or pure mpi4py
โ†’
Native MPI Process
GPU
GPU
โ†’
๐Ÿ”„
[cite: L14]
MPI_Allreduce
Gradient sync across
all nodes via RDMA
โ†’
๐Ÿ”
[cite: L12]
MPI Rank Discovery
COMM_WORLD.Get_rank()
MPI assigns 0 to N-1
โ†’
โšก
[cite: L13]
CUDA-Aware MPI Transfer
GPU memory direct
No CPU copy needed
โ†’
โœ…
[cite: L15]
Training Completes
All MPI processes exit
Manual cleanup + save
โฌ‡
3.MPI JOB
LAUNCHER
[cite: L4, L10, L11, L12, L13]
๐Ÿ”‘
[cite: L4]
SSH Passwordless Setup
mpirun uses SSH to
launch on remote nodes
โ†’
๐Ÿ“‹
[cite: L10]
Create MPI Hostfile
node01 slots=8
node02 slots=8
โ†’
๐Ÿš€
[cite: L11]
mpirun Launch Command
mpirun -np 32
--hostfile hostfile
--mca pml ucx
python train.py
โ†’
๐ŸŒ
[cite: L12]
MPI COMM_WORLD Init
Each process gets rank
World size = total GPUs
โ†’
๐Ÿ“ก
[cite: L13]
UCX Transport Selection
Auto-selects RDMA
over RoCE for large msgs
โฌ‡
2.NETWORKING LAYER
(RoCE FABRIC)
[cite: L6, L7, L13, L14]
โš™๏ธ
[cite: L7]
Install UCX Library
Unified Communication X
Enables RDMA for MPI
โ†’
๐Ÿ”€
[cite: L6]
Ethernet Switches
Leaf-Spine
PFC + ECN + QoS
โŸถ
๐Ÿ–ง
[cite: L5]
Physical NIC Paths
IPs + /etc/hosts entries
for all node hostnames
โŸถ
๐Ÿ”—
[cite: L13]
UCX Selects RDMA via RoCE
UCX_TLS=rc,ud,sm
Physical 800G NICs
โŸถ
800 Gbps
800 Gbps
RDMA
โšก MPI + UCX + RDMA PATH
โฌ‡
1.INFRASTRUCTURE
& BARE METAL
[cite: L1-L5, L8]
๐Ÿง
[cite: L1]
Install Linux OS
Ubuntu / RHEL
Bare metal, no containers
โ†’
B300 Servers
โ†’
๐ŸŽฎ
[cite: L2]
Install NVIDIA Drivers & CUDA
System-wide CUDA
8 GPUs per server
โ†’
๐Ÿ“ฆ
[cite: L3]
Install OpenMPI / IntelMPI
apt install openmpi-bin
mpirun --version
โ†’
๐Ÿ”Œ
[cite: L5]
Configure Physical NICs
800G NICs + IPs
No SR-IOV, no VFs
STORAGE NETWORK
[cite: L8]
🗂 Shared Storage
NFS / Lustre
Checkpoints & Datasets
DESIGN 04 / 05 ยท HPC RoCE Design Patterns