Design 5 — Hybrid: Kubernetes + Slurm + RoCE
What real-world large operators end up running. Kubernetes for the elastic stuff — inference, monitoring, data pipelines, CI. Slurm for the synchronous bulk training jobs that need bare-metal performance. Same RoCE fabric underneath.
Best for: Organizations running both training and inference, or migrating gradually from a Slurm legacy to a K8s future without rebuilding from scratch. Trade-offs: Two control planes to operate. Resource partitioning between Kubernetes and Slurm is a permanent boundary-management problem.
After this page, you'll be able to
- Walk the 15-layer stack for this design and name where it breaks — two schedulers over one RoCE fabric, where the new failure points live at the Kubernetes/Slurm boundary that the single-scheduler designs never have.
- See why you'd run two control planes at once — Slurm for the synchronous bulk training that wants bare-metal performance, Kubernetes for the elastic inference, monitoring, and CI alongside it.
- Decide when the extra complexity is worth it — pick it only when you genuinely run both workload types, and avoid it when one scheduler already covers you, because the boundary management isn't free.
Architecture
Build steps — the 15 layers
When to pick this design
Pick this when:
- You genuinely run both workload types — synchronous bulk training that wants bare-metal Slurm performance, and elastic inference, dashboards, and CI that want Kubernetes.
- You're migrating gradually from a Slurm legacy toward a K8s future and need both to coexist without a rebuild from scratch.
- You can afford to operate two control planes and own the resource boundary between them.
Avoid it when:
- One scheduler already covers your workloads — running both just to have both buys you a second control plane and a permanent partitioning problem for nothing.
- You don't have the operational headcount to keep two scheduling stacks healthy and the K8s/Slurm node boundary tuned.
- Your workload is purely training (use Design 2) or purely cloud-native (use Design 1) — the hybrid only pays off when both are real.
💡 What you should remember
| # | Concept | Why it matters | |
|---|---|---|---|
| 1 | ⚖️ | Two schedulers, one RoCE fabric | Slurm runs the bulk training, Kubernetes runs the elastic services — both share the same lossless Ethernet underneath. |
| 2 | 🧱 | The node boundary is a permanent problem | Partitioning hardware between K8s and Slurm is an ongoing tuning job, not a one-time setup — the boundary is where this design earns its complexity. |
| 3 | 🔀 | It's the real-world large-operator pattern | Organizations running both training and inference end up here, often while migrating from a Slurm legacy toward K8s. |
| 4 | 💰 | Complexity isn't free | Two control planes only pay off when you genuinely run both workload types — if one scheduler covers you, this is overbuilding. |
What's next
- Design 1 — Kubernetes + SR-IOV + RoCE — the pure cloud-native multi-tenant design.
- Design 2 — Bare Metal + Slurm + RoCE — the pure classic-HPC alternative.
- Design 3 — Kubernetes + Physical NIC + RoCE — simpler K8s, no SR-IOV.
- Design 4 — Bare Metal + MPI + RoCE — minimal lab setup.