Skip to main content

Design 2 — Bare Metal + Slurm + RoCE

The traditional HPC blueprint. No containers, no Kubernetes — Slurm submits jobs directly to bare-metal nodes. Highest performance ceiling and the simplest mental model, at the cost of multi-tenancy and dynamic scheduling.

Best for: Performance-sensitive single-tenant workloads. Research labs, weather, physics, training runs that pin the whole cluster. Trade-offs: No isolation between users. No container portability. Manual environment management.

After this page, you'll be able to
  1. Place the bare-metal + Slurm + RoCE pattern — when no containers and no Kubernetes is the right call, and why national labs run it for performance-sensitive single-tenant jobs.
  2. Name the trade-offs you're accepting — no inter-user isolation, no container portability, and manual environment management in exchange for the highest performance ceiling and simplest mental model.
  3. Walk the 15-layer build — from bare-metal nodes up through RoCE and Slurm job submission, using the interactive architecture and build-step diagrams.

Architecture

Build steps — the 15 layers

When to pick this design

Pick this when:

  • You're a classic HPC shop — research lab, weather, physics — where one big synchronous job pins the whole cluster and raw performance is the goal.
  • You want the simplest mental model and the highest performance ceiling, with no container runtime or CNI between your code and the NIC.
  • A single team owns the hardware and manages the environment by hand (modules, MPI builds, drivers) without needing self-service.

Avoid it when:

  • You need multi-tenancy or isolation between users — Slurm partitions schedule jobs, but bare metal gives you no container-level isolation. Reach for Design 1.
  • You want portable, reproducible environments — without containers, every node's software stack is yours to keep in sync manually.
  • You also run inference, dashboards, or CI that want cloud-native orchestration — that's the hybrid case (Design 5).

💡 What you should remember

#ConceptWhy it matters
1🏔️No containers, no Kubernetes — Slurm schedules straight onto bare metalStripping the orchestration layers gives the highest performance ceiling and the fewest things between your job and the RoCE NIC.
2👤Single-tenant by designSlurm queues jobs, but there's no container isolation — one big run pins the cluster, which is exactly the model national labs want.
3🔧Environment management is manualDrivers, MPI builds, and modules are yours to keep consistent across nodes; you trade portability for control.
4📐Simplest stack to reason aboutFewer layers means fewer places to break — when peak performance matters more than flexibility, that simplicity is the feature.

What's next