Skip to main content

Design 4 — Bare Metal + MPI + RoCE

The simplest possible setup. No scheduler, no orchestrator — mpirun directly across a known list of hosts. RoCE handles RDMA. Closer to a lab cluster than a production system, but useful as a baseline.

Best for: Small dedicated clusters, benchmark rigs, anywhere the workload is one job at a time and you control the host list manually. Trade-offs: No queueing, no fairness, no automated fault recovery. Falls over the moment you have more than one user.

After this page, you'll be able to
  1. Walk the 15-layer stack for this design and name where it breaks — the leanest of the five, where the MPI launcher (mpirun over PMIx) replaces both a scheduler and an orchestrator and talks straight to RoCE.
  2. See what maximum control costs you — no queueing, no fairness, no fault recovery; you own the host list and every process placement by hand.
  3. Decide when minimum abstraction is the right call — pick it for research and tightly-coupled MPI codes where you want nothing between you and the wire, and avoid it the moment you need scheduling, isolation, or self-service.

Architecture

Build steps — the 15 layers

When to pick this design

Pick this when:

  • You're doing research or running tightly-coupled MPI codes and want the lowest-level setup possible — mpirun straight across a known host list, nothing in between.
  • The cluster runs one job at a time and you control the host list manually, so a scheduler would just be overhead.
  • You want a clean benchmark baseline — it's what most NCCL and MPI tutorials assume underneath.

Avoid it when:

  • More than one user shares the cluster — there's no queueing, no fairness, no isolation, and it falls over the moment two jobs compete.
  • You need automated fault recovery or job restart — a dead rank takes the whole mpirun down with it.
  • You want self-service or declarative scheduling — that's Slurm (Design 2) or Kubernetes (Designs 1 and 3).

💡 What you should remember

#ConceptWhy it matters
1🏁mpirun is the entire control planeThe MPI launcher (over PMIx) replaces both scheduler and orchestrator — minimum abstraction, maximum control.
2📝You own the host list by handNo discovery, no scheduling — you place ranks yourself, which is fine for one job and unmanageable for many.
3🚫No queueing, fairness, or fault recoveryA single user gets full control; a second user gets a collision. This is a lab baseline, not a shared production system.
4🔬The reference baselineIt's what NCCL and MPI tutorials assume underneath — understand this and the heavier designs are just orchestration layered on top.

What's next