Design 5 — Hybrid: Kubernetes + Slurm + RoCE

What real-world large operators end up running. Kubernetes for the elastic stuff — inference, monitoring, data pipelines, CI. Slurm for the synchronous bulk training jobs that need bare-metal performance. Same RoCE fabric underneath.

Best for: Organizations running both training and inference, or migrating gradually from a Slurm legacy to a K8s future without rebuilding from scratch. Trade-offs: Two control planes to operate. Resource partitioning between Kubernetes and Slurm is a permanent boundary-management problem.

After this page, you'll be able to

Walk the 15-layer stack for this design and name where it breaks — two schedulers over one RoCE fabric, where the new failure points live at the Kubernetes/Slurm boundary that the single-scheduler designs never have.
See why you'd run two control planes at once — Slurm for the synchronous bulk training that wants bare-metal performance, Kubernetes for the elastic inference, monitoring, and CI alongside it.
Decide when the extra complexity is worth it — pick it only when you genuinely run both workload types, and avoid it when one scheduler already covers you, because the boundary management isn't free.

Architecture

Build steps — the 15 layers

When to pick this design

Pick this when:

You genuinely run both workload types — synchronous bulk training that wants bare-metal Slurm performance, and elastic inference, dashboards, and CI that want Kubernetes.
You're migrating gradually from a Slurm legacy toward a K8s future and need both to coexist without a rebuild from scratch.
You can afford to operate two control planes and own the resource boundary between them.

Avoid it when:

One scheduler already covers your workloads — running both just to have both buys you a second control plane and a permanent partitioning problem for nothing.
You don't have the operational headcount to keep two scheduling stacks healthy and the K8s/Slurm node boundary tuned.
Your workload is purely training (use Design 2) or purely cloud-native (use Design 1) — the hybrid only pays off when both are real.

💡 What you should remember

#		Concept	Why it matters
1	⚖️	Two schedulers, one RoCE fabric	Slurm runs the bulk training, Kubernetes runs the elastic services — both share the same lossless Ethernet underneath.
2	🧱	The node boundary is a permanent problem	Partitioning hardware between K8s and Slurm is an ongoing tuning job, not a one-time setup — the boundary is where this design earns its complexity.
3	🔀	It's the real-world large-operator pattern	Organizations running both training and inference end up here, often while migrating from a Slurm legacy toward K8s.
4	💰	Complexity isn't free	Two control planes only pay off when you genuinely run both workload types — if one scheduler covers you, this is overbuilding.

What's next

Design 1 — Kubernetes + SR-IOV + RoCE — the pure cloud-native multi-tenant design.
Design 2 — Bare Metal + Slurm + RoCE — the pure classic-HPC alternative.
Design 3 — Kubernetes + Physical NIC + RoCE — simpler K8s, no SR-IOV.
Design 4 — Bare Metal + MPI + RoCE — minimal lab setup.

Architecture​

Build steps — the 15 layers​

When to pick this design​

💡 What you should remember​

What's next​

Architecture

Build steps — the 15 layers

When to pick this design

💡 What you should remember

What's next