Skip to main content

Design 5 — Hybrid: Kubernetes + Slurm + RoCE

What real-world large operators end up running. Kubernetes for the elastic stuff — inference, monitoring, data pipelines, CI. Slurm for the synchronous bulk training jobs that need bare-metal performance. Same RoCE fabric underneath.

Best for: Organizations running both training and inference, or migrating gradually from a Slurm legacy to a K8s future without rebuilding from scratch. Trade-offs: Two control planes to operate. Resource partitioning between Kubernetes and Slurm is a permanent boundary-management problem.

Architecture

Build steps — the 15 layers