DESIGN 02 / 05 — Bare Metal + Slurm + RoCE

Overview: Traditional HPC approach with no containers or Kubernetes. Slurm manages job scheduling directly on bare metal servers. NCCL and CUDA are installed system-wide. Physical NICs are used directly — no SR-IOV virtual functions. RoCE provides lossless RDMA fabric. Simpler setup but less flexible resource sharing.

Best for: Dedicated research clusters, academic HPC centers, teams preferring traditional HPC workflows.
Trade-offs: Less multi-tenancy support, manual resource management, harder to share GPUs between users.

Overview

Traditional HPC approach with no containers or Kubernetes. Slurm manages job scheduling directly on bare metal servers. NCCL and CUDA are installed system-wide. Physical NICs are used directly — no SR-IOV virtual functions. RoCE provides lossless RDMA fabric. Simpler setup but less flexible resource sharing.

Best For

✦ Dedicated research clusters
✦ Academic HPC centers
✦ Teams preferring traditional HPC workflows
✦ Single-tenant dedicated clusters
✦ Teams with existing Slurm expertise

Trade-Offs

⚠ Less multi-tenancy support
⚠ Manual resource management
⚠ Harder to share GPUs between users
⚠ No automatic pod scheduling
⚠ All software must be installed system-wide