Design 4 — Bare Metal + MPI + RoCE

The simplest possible setup. No scheduler, no orchestrator — mpirun directly across a known list of hosts. RoCE handles RDMA. Closer to a lab cluster than a production system, but useful as a baseline.

Best for: Small dedicated clusters, benchmark rigs, anywhere the workload is one job at a time and you control the host list manually. Trade-offs: No queueing, no fairness, no automated fault recovery. Falls over the moment you have more than one user.

After this page, you'll be able to

Walk the 15-layer stack for this design and name where it breaks — the leanest of the five, where the MPI launcher (mpirun over PMIx) replaces both a scheduler and an orchestrator and talks straight to RoCE.
See what maximum control costs you — no queueing, no fairness, no fault recovery; you own the host list and every process placement by hand.
Decide when minimum abstraction is the right call — pick it for research and tightly-coupled MPI codes where you want nothing between you and the wire, and avoid it the moment you need scheduling, isolation, or self-service.

Architecture

Build steps — the 15 layers

When to pick this design

Pick this when:

You're doing research or running tightly-coupled MPI codes and want the lowest-level setup possible — mpirun straight across a known host list, nothing in between.
The cluster runs one job at a time and you control the host list manually, so a scheduler would just be overhead.
You want a clean benchmark baseline — it's what most NCCL and MPI tutorials assume underneath.

Avoid it when:

More than one user shares the cluster — there's no queueing, no fairness, no isolation, and it falls over the moment two jobs compete.
You need automated fault recovery or job restart — a dead rank takes the whole mpirun down with it.
You want self-service or declarative scheduling — that's Slurm (Design 2) or Kubernetes (Designs 1 and 3).

💡 What you should remember

#		Concept	Why it matters
1	🏁	`mpirun` is the entire control plane	The MPI launcher (over PMIx) replaces both scheduler and orchestrator — minimum abstraction, maximum control.
2	📝	You own the host list by hand	No discovery, no scheduling — you place ranks yourself, which is fine for one job and unmanageable for many.
3	🚫	No queueing, fairness, or fault recovery	A single user gets full control; a second user gets a collision. This is a lab baseline, not a shared production system.
4	🔬	The reference baseline	It's what NCCL and MPI tutorials assume underneath — understand this and the heavier designs are just orchestration layered on top.

What's next

Design 1 — Kubernetes + SR-IOV + RoCE — the flexible multi-tenant production design.
Design 2 — Bare Metal + Slurm + RoCE — add a scheduler without leaving bare metal.
Design 3 — Kubernetes + Physical NIC + RoCE — K8s orchestration, no SR-IOV.
Design 5 — Hybrid: K8s + Slurm + RoCE — real-world large operator.

Architecture​

Build steps — the 15 layers​

When to pick this design​

💡 What you should remember​

What's next​

Architecture

Build steps — the 15 layers

When to pick this design

💡 What you should remember

What's next