Skip to main content

Design 5 — Hybrid: Kubernetes + Slurm + RoCE

What real-world large operators end up running. Kubernetes for the elastic stuff — inference, monitoring, data pipelines, CI. Slurm for the synchronous bulk training jobs that need bare-metal performance. Same RoCE fabric underneath.

Best for: Organizations running both training and inference, or migrating gradually from a Slurm legacy to a K8s future without rebuilding from scratch. Trade-offs: Two control planes to operate. Resource partitioning between Kubernetes and Slurm is a permanent boundary-management problem.

After this page, you'll be able to
  1. Walk the 15-layer stack for this design and name where it breaks — two schedulers over one RoCE fabric, where the new failure points live at the Kubernetes/Slurm boundary that the single-scheduler designs never have.
  2. See why you'd run two control planes at once — Slurm for the synchronous bulk training that wants bare-metal performance, Kubernetes for the elastic inference, monitoring, and CI alongside it.
  3. Decide when the extra complexity is worth it — pick it only when you genuinely run both workload types, and avoid it when one scheduler already covers you, because the boundary management isn't free.

Architecture

Build steps — the 15 layers

When to pick this design

Pick this when:

  • You genuinely run both workload types — synchronous bulk training that wants bare-metal Slurm performance, and elastic inference, dashboards, and CI that want Kubernetes.
  • You're migrating gradually from a Slurm legacy toward a K8s future and need both to coexist without a rebuild from scratch.
  • You can afford to operate two control planes and own the resource boundary between them.

Avoid it when:

  • One scheduler already covers your workloads — running both just to have both buys you a second control plane and a permanent partitioning problem for nothing.
  • You don't have the operational headcount to keep two scheduling stacks healthy and the K8s/Slurm node boundary tuned.
  • Your workload is purely training (use Design 2) or purely cloud-native (use Design 1) — the hybrid only pays off when both are real.

💡 What you should remember

#ConceptWhy it matters
1⚖️Two schedulers, one RoCE fabricSlurm runs the bulk training, Kubernetes runs the elastic services — both share the same lossless Ethernet underneath.
2🧱The node boundary is a permanent problemPartitioning hardware between K8s and Slurm is an ongoing tuning job, not a one-time setup — the boundary is where this design earns its complexity.
3🔀It's the real-world large-operator patternOrganizations running both training and inference end up here, often while migrating from a Slurm legacy toward K8s.
4💰Complexity isn't freeTwo control planes only pay off when you genuinely run both workload types — if one scheduler covers you, this is overbuilding.

What's next