DESIGN 05 / 05 — Hybrid: Kubernetes + Slurm + RoCE

Overview: The most sophisticated design — runs both Kubernetes and Slurm on the same cluster sharing the same RoCE fabric and GPUs. Kubernetes handles containerized AI/ML workloads and inference services. Slurm handles traditional HPC batch jobs and research workloads. A resource broker (like Volcano or Kueue) coordinates between them to avoid GPU conflicts. Used by Google, Meta, and large AI labs.

Best for: Large enterprises and research labs needing both cloud-native (Kubernetes) and traditional HPC (Slurm) workflows on one cluster.
Trade-offs: Highest complexity, requires careful resource partitioning, needs skilled team to operate both orchestrators.

Overview

The most sophisticated design — runs both Kubernetes and Slurm on the same cluster sharing the same RoCE fabric and GPUs. Kubernetes handles containerized AI/ML workloads and inference services. Slurm handles traditional HPC batch jobs. A resource broker like Volcano or Kueue coordinates between them to avoid GPU conflicts.

Best For

✦ Large enterprises and research labs
✦ Teams needing both cloud-native and HPC workflows
✦ Organizations with mixed workload types
✦ Used by Google, Meta, and large AI labs
✦ Maximum flexibility on one shared cluster

Trade-Offs

⚠ Highest setup and operational complexity
⚠ Requires careful GPU resource partitioning
⚠ Needs skilled team to operate both schedulers
⚠ Resource coordinator adds another layer
⚠ Hardest to debug when things go wrong