Skip to main content

Operators and Helm

You've installed kubernetes. You've installed Multus + SR-IOV CNI. But you don't yet have a working AI cluster — because the GPU drivers, the OFED driver, the device plugins, the topology discovery, and the SR-IOV inventory aren't set up.

That's the job of Operators. This page is what they are, the ones you'll install, and how to use Helm to deploy them.


What an Operator is

A Kubernetes Operator is a controller that manages a complex software stack as if it were a single k8s resource.

Without an Operator, installing the NVIDIA driver on every node looks like:

  1. SSH into every node
  2. dpkg -i nvidia-driver-*.deb
  3. Reboot
  4. Hope nothing breaks
  5. Repeat in 6 months when you upgrade

With the NVIDIA GPU Operator, it looks like:

  1. helm install gpu-operator nvidia/gpu-operator
  2. The Operator deploys a DaemonSet to every node
  3. The DaemonSet installs the driver, the container runtime hook, the Device Plugin, DCGM exporter
  4. The Operator monitors them; if a node's driver gets out of sync, it fixes it
  5. Upgrading is a one-line Helm change

The Operator pattern is declarative — you describe what state you want, the Operator makes it true and keeps it true.

Internally, an Operator is:

  • A Custom Resource Definition (CRD) that defines a new k8s resource type (e.g., ClusterPolicy)
  • A Controller (a pod that runs continuously) that watches that resource and acts on it

Operators are how complex software (databases, ML pipelines, drivers) gets packaged for k8s. There are hundreds of them; AI clusters use a specific handful.


The Operators you'll meet on an AI cluster

OperatorWhat it managesWhen to install
NVIDIA GPU OperatorNVIDIA driver, container toolkit, Device Plugin, DCGM exporter, Node Feature DiscoveryEvery cluster with GPUs
NVIDIA Network OperatorMellanox OFED driver, RDMA shared device plugin, IB-K8s, kernel modulesEvery cluster with RDMA NICs
SR-IOV Network OperatorVF inventory, NetworkAttachmentDefinitions, SR-IOV CNIEvery cluster doing SR-IOV (= every AI cluster)
Node Feature Discovery (NFD)Labels nodes with hardware features (CPU model, NUMA topology, NIC count)Always; comes with GPU Operator
VolcanoBatch job scheduler with gang schedulingIf you're running multi-node training
Prometheus OperatorTelemetry (metrics)For monitoring
Grafana OperatorDashboardsFor visualizing telemetry

The first three are the load-bearing ones for AI clusters. Without them, you'd be managing drivers and CNI configs by hand on hundreds of nodes — error-prone, slow, and a source of constant production debugging.


Helm — the package manager

Helm is the package manager for k8s. A Helm chart is a templatized bundle of k8s YAML — Operators, Deployments, Services, ConfigMaps, all parameterized.

Installing Helm

curl https://baltocdn.com/helm/signing.asc | sudo apt-key add -
echo "deb https://baltocdn.com/helm/stable/debian/ all main" | sudo tee /etc/apt/sources.list.d/helm-stable-debian.list
sudo apt update && sudo apt install helm

(Or brew install helm on macOS / a release binary on RHEL.)

Using Helm

# Add a repository
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

# Install a chart with default values
helm install gpu-operator nvidia/gpu-operator -n gpu-operator --create-namespace

# Install with custom values (production)
helm install gpu-operator nvidia/gpu-operator \
-n gpu-operator --create-namespace \
--version v24.9.1 \
-f my-values.yaml

# List what's installed
helm list -A

# Upgrade
helm upgrade gpu-operator nvidia/gpu-operator -n gpu-operator --version v24.10.0

# Uninstall
helm uninstall gpu-operator -n gpu-operator

The my-values.yaml file overrides chart defaults — driver version, image registry, feature toggles, etc. Every chart publishes its values.yaml reference so you can see what's configurable.


A concrete walkthrough: installing the GPU Operator

Step by step:

# 1. Add the NVIDIA Helm repo
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

# 2. Create a values file with what you want to customize
cat > gpu-operator-values.yaml <<EOF
driver:
enabled: true
version: "550.90.07"
toolkit:
enabled: true
devicePlugin:
enabled: true
dcgmExporter:
enabled: true
nfd:
enabled: true
mig:
strategy: none
EOF

# 3. Install
helm install gpu-operator nvidia/gpu-operator \
-n gpu-operator --create-namespace \
-f gpu-operator-values.yaml

# 4. Wait for everything to come up
kubectl get pods -n gpu-operator -w

Within 5-10 minutes you should see a long list of pods all in Running state:

gpu-operator-77c44... Running
nvidia-driver-daemonset-... Running (one per node with GPUs)
nvidia-container-toolkit-daemonset-... Running (one per node)
nvidia-device-plugin-daemonset-... Running (one per node)
nvidia-dcgm-exporter-... Running (one per node)
gpu-feature-discovery-... Running (one per node)
node-feature-discovery-... Running

If any are CrashLoopBackOff, that's your debug target — kubectl logs <pod> will tell you why.

After the Operator finishes, every node has:

  • NVIDIA driver loaded (installed by the DaemonSet)
  • Container runtime configured to pass through GPUs
  • The nvidia.com/gpu resource registered, so pods can request GPUs

Verify with:

kubectl describe node <node-name> | grep nvidia
# Should show:
# nvidia.com/gpu: 8

8 GPUs available for scheduling.


The Network Operator + SR-IOV Operator

Similar pattern. The Network Operator installs OFED + the RDMA device plugin:

helm install network-operator nvidia/network-operator \
-n nvidia-network-operator --create-namespace \
--set deployCR=true

Then a NicClusterPolicy CR tells it what to install:

apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy
metadata:
name: nic-cluster-policy
namespace: nvidia-network-operator
spec:
ofedDriver:
image: doca-driver
repository: nvcr.io/nvidia/mellanox
version: 24.10-0.5.5.0
rdmaSharedDevicePlugin:
config: |
{"configList": [{"resourceName": "rdma_shared_device_a", "rdmaHcaMax": 64, "selectors": {"vendors": ["15b3"]}}]}

The SR-IOV Operator (often installed via the Network Operator):

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
name: rail-0
spec:
resourceName: rail0_vf
nodeSelector:
feature.node.kubernetes.io/network-sriov.capable: "true"
numVfs: 8
nicSelector:
pfNames: ["enp1s0"]
isRdma: true

This says: on every SR-IOV-capable node, create 8 VFs from enp1s0, label them as the rail0_vf resource so pods can request them. Repeat for rail1_vf, etc.


Reading Operator logs

When something's wrong, Operators tell you in their pod logs.

# What's running
kubectl get pods -n gpu-operator
kubectl get pods -n nvidia-network-operator

# Look at the main Operator pod's logs
kubectl logs -n gpu-operator gpu-operator-77c44...

# Look at a DaemonSet pod's logs (one per node)
kubectl logs -n gpu-operator nvidia-driver-daemonset-abc123

# Watch in real time
kubectl logs -f -n gpu-operator <pod-name>

# Describe the CR (status section is gold)
kubectl describe clusterpolicy cluster-policy -n gpu-operator

The CR's status section usually has structured info on what the Operator is currently doing — drivers installing, DaemonSets waiting, etc.


Common Operator failures

SymptomLikely cause
Operator pod in CrashLoopBackOffHelm values invalid or unsupported k8s version
DaemonSet pod Pending on one nodeNode lacks a required label or NFD hasn't run yet
Driver DaemonSet CrashLoopBackOffKernel version not supported by the bundled driver — update either kernel or driver
nvidia.com/gpu resource doesn't appearDevice Plugin DaemonSet not running, or driver didn't load
rdma_shared_device_a resource missingNetwork Operator's RDMA shared device plugin not running
SR-IOV VFs not appearingSriovNetworkNodePolicy doesn't match the node, or numVfs exceeds firmware limit

The pattern: when a resource you expect isn't on the node, find the Operator that should've created it and read its logs.


What you should remember

  • Operators manage complex stacks declaratively. Install via Helm; configure via Custom Resources.
  • AI clusters need three load-bearing Operators: GPU Operator, Network Operator, SR-IOV Operator.
  • Helm is the package manager. helm install, helm upgrade, helm list — standard ops.
  • Helm values.yaml is how you customize a chart. Don't hand-edit deployed YAML.
  • kubectl describe <CR> is gold for debugging Operators — the status section tells you what's happening.
  • DaemonSet pods are one-per-node. If one's broken on a specific node, that node has a problem.

You're done with the K8s section. Head to Cluster Build Guide to see all of this come together in a build, or revisit Linux for Network Engineers for the host side that the Operators rely on.