Design 1 — Kubernetes + SR-IOV + RoCE Architecture

DESIGN 01 / 05 — KUBERNETES + SR-IOV + RoCE (HIGH-LEVEL HPC ARCHITECTURE)

Best for: Large shared clusters, multi-tenant environments [cite: Overview]

4. ML/AI APPLICATIONS
& TRAINING [cite: L7, L12–L13, L15]

📦

[cite: L7]

NCCL in Container Images

Bake NCCL into Docker images.
pytorch:24.01-py3 base image

→

Pod

GPU

→

🧮

[cite: L12]

Training Code Calls NCCL

Backward pass triggers
all-reduce across GPUs

→

🔍

[cite: L12.5]

NCCL Rank Discovery

Reads RANK, WORLD_SIZE
MASTER_ADDR env vars

→

🌐

[cite: L13]

Initialize Comm Topology

Ring or Tree all-reduce
across all GPU ranks

→

✅

[cite: L15]

Training Job Completes

Pods tear down, GPUs released
Checkpoint → Shared Storage

⬇

3. KUBERNETES
ORCHESTRATION [cite: L3.5, L4, L9, L10]

☸️

[cite: L4]

Install Kubernetes

kubeadm init
Master + Worker nodes

→

📡

[cite: L3.5]

GPU Resource Tracking

NVIDIA Device Plugin
DaemonSet on every node

→

☸️

[cite: L4]

Cluster Control Plane
(Master Node)

☸ Kubelet

☸ Kubelet

→

📋

[cite: L9]

Create Pod Specs

GPU requests + SR-IOV VF
annotations + NCCL env vars

→

📌

[cite: L10]

Scheduler Places Pods

Places pods based on
GPU availability data

⬇

2. NETWORKING LAYER
(RoCE FABRIC) [cite: L4.5, L5, L6, L11, L13.5, L14]

🔌

[cite: L4.5]

Multus CNI Plugin

Enables multiple NICs per pod.
Without Multus: only 1 NIC!

⚠ Critical:
SR-IOV needs
Multus!

→

🔀

[cite: L6]

Ethernet Switches

Leaf-Spine Topology

⟶

🖧

[cite: L11]

CNI Assigns SR-IOV VFs & IPs

Each pod gets dedicated
VF + IP for RDMA traffic

⟶

🔗

[cite: L13.5]

NCCL Creates RDMA Queue Pairs

Send Q + Receive Q +
Completion Q per GPU pair

⟶

800 Gbps
RDMA

⚡ DIRECT RDMA PATH

⬇

1. INFRASTRUCTURE
& BARE METAL [cite: L1–L3, L8]

🐧

[cite: L1]

Install Linux OS

Ubuntu 22.04 LTS
Directly on hardware

→

NVIDIA B300

NVIDIA
B300
Servers

→

🎮

[cite: L3]

Install NVIDIA Drivers & CUDA

CUDA Toolkit 12.3
nvidia-smi verification

→

⚡

[cite: L2]

Enable SR-IOV on 800G NICs

Up to 16 VFs per NIC
sriov_numvfs = 16

STORAGE NETWORK

[cite: L8]

🗄 Shared Storage

NFS / Lustre
Object Storage
Checkpoints
Datasets