Traditional HPC · High Performance · Manual Control
Overview: Traditional HPC approach with no containers or Kubernetes. Slurm manages job scheduling directly on bare metal servers. NCCL and CUDA are installed system-wide. Physical NICs are used directly — no SR-IOV virtual functions. RoCE provides lossless RDMA fabric. Simpler setup but less flexible resource sharing.
Best for: Dedicated research clusters, academic HPC centers, teams preferring traditional HPC workflows. Trade-offs: Less multi-tenancy support, manual resource management, harder to share GPUs between users.
Overview
Traditional HPC approach with no containers or Kubernetes. Slurm manages job scheduling directly on bare metal servers. NCCL and CUDA are installed system-wide. Physical NICs are used directly — no SR-IOV virtual functions. RoCE provides lossless RDMA fabric. Simpler setup but less flexible resource sharing.
Best For
✦ Dedicated research clusters
✦ Academic HPC centers
✦ Teams preferring traditional HPC workflows
✦ Single-tenant dedicated clusters
✦ Teams with existing Slurm expertise
Trade-Offs
⚠ Less multi-tenancy support
⚠ Manual resource management
⚠ Harder to share GPUs between users
⚠ No automatic pod scheduling
⚠ All software must be installed system-wide