Design 4 — Bare Metal + MPI + RoCE
The simplest possible setup. No scheduler, no orchestrator — mpirun directly across a known list of hosts. RoCE handles RDMA. Closer to a lab cluster than a production system, but useful as a baseline.
Best for: Small dedicated clusters, benchmark rigs, anywhere the workload is one job at a time and you control the host list manually. Trade-offs: No queueing, no fairness, no automated fault recovery. Falls over the moment you have more than one user.
After this page, you'll be able to
- Walk the 15-layer stack for this design and name where it breaks — the leanest of the five, where the MPI launcher (
mpirunover PMIx) replaces both a scheduler and an orchestrator and talks straight to RoCE. - See what maximum control costs you — no queueing, no fairness, no fault recovery; you own the host list and every process placement by hand.
- Decide when minimum abstraction is the right call — pick it for research and tightly-coupled MPI codes where you want nothing between you and the wire, and avoid it the moment you need scheduling, isolation, or self-service.
Architecture
Build steps — the 15 layers
When to pick this design
Pick this when:
- You're doing research or running tightly-coupled MPI codes and want the lowest-level setup possible —
mpirunstraight across a known host list, nothing in between. - The cluster runs one job at a time and you control the host list manually, so a scheduler would just be overhead.
- You want a clean benchmark baseline — it's what most NCCL and MPI tutorials assume underneath.
Avoid it when:
- More than one user shares the cluster — there's no queueing, no fairness, no isolation, and it falls over the moment two jobs compete.
- You need automated fault recovery or job restart — a dead rank takes the whole
mpirundown with it. - You want self-service or declarative scheduling — that's Slurm (Design 2) or Kubernetes (Designs 1 and 3).
💡 What you should remember
| # | Concept | Why it matters | |
|---|---|---|---|
| 1 | 🏁 | mpirun is the entire control plane | The MPI launcher (over PMIx) replaces both scheduler and orchestrator — minimum abstraction, maximum control. |
| 2 | 📝 | You own the host list by hand | No discovery, no scheduling — you place ranks yourself, which is fine for one job and unmanageable for many. |
| 3 | 🚫 | No queueing, fairness, or fault recovery | A single user gets full control; a second user gets a collision. This is a lab baseline, not a shared production system. |
| 4 | 🔬 | The reference baseline | It's what NCCL and MPI tutorials assume underneath — understand this and the heavier designs are just orchestration layered on top. |
What's next
- Design 1 — Kubernetes + SR-IOV + RoCE — the flexible multi-tenant production design.
- Design 2 — Bare Metal + Slurm + RoCE — add a scheduler without leaving bare metal.
- Design 3 — Kubernetes + Physical NIC + RoCE — K8s orchestration, no SR-IOV.
- Design 5 — Hybrid: K8s + Slurm + RoCE — real-world large operator.