DESIGN 04 / 05

Bare Metal + MPI + RoCE

Lowest Level · Maximum Control · Parallel Computing

Overview: The most bare-bones approach. No Kubernetes, no Slurm, no NCCL — just MPI (Message Passing Interface) directly on bare metal servers with RoCE for fast networking. MPI is the original HPC communication library, predating GPU clusters. Users manually launch mpirun across nodes. Gives maximum control but requires deep MPI expertise.

Best for: HPC researchers, scientific computing, teams already using MPI for non-GPU workloads, legacy HPC migrations.
Trade-offs: No automatic scheduling, manual job management, requires MPI expertise, less GPU-optimized than NCCL.
Overview
The most bare-bones approach. No Kubernetes, no Slurm, no NCCL — just MPI directly on bare metal servers with RoCE for fast networking. MPI is the original HPC communication library predating GPU clusters. Users manually launch mpirun across nodes. Gives maximum control but requires deep MPI expertise.
Best For
✦ HPC researchers and scientists
✦ Scientific computing workloads
✦ Teams already using MPI for non-GPU work
✦ Legacy HPC migrations to GPU
✦ Maximum hardware control requirements
Trade-Offs
⚠ No automatic scheduling — fully manual
⚠ Requires deep MPI expertise
⚠ Less GPU-optimized than NCCL
⚠ No job queuing or fair-share
⚠ Hardest to manage at scale