Operators and Helm

You've installed kubernetes. You've installed Multus + SR-IOV CNI. But you don't yet have a working AI cluster — because the GPU drivers, the OFED driver, the device plugins, the topology discovery, and the SR-IOV inventory aren't set up.

That's the job of Operators. This page is what they are, the ones you'll install, and how to use Helm to deploy them.

What an Operator is

A Kubernetes Operator is a controller that manages a complex software stack as if it were a single k8s resource.

Without an Operator, installing the NVIDIA driver on every node looks like:

SSH into every node
dpkg -i nvidia-driver-*.deb
Reboot
Hope nothing breaks
Repeat in 6 months when you upgrade

With the NVIDIA GPU Operator, it looks like:

helm install gpu-operator nvidia/gpu-operator
The Operator deploys a DaemonSet to every node
The DaemonSet installs the driver, the container runtime hook, the Device Plugin, DCGM exporter
The Operator monitors them; if a node's driver gets out of sync, it fixes it
Upgrading is a one-line Helm change

The Operator pattern is declarative — you describe what state you want, the Operator makes it true and keeps it true.

Internally, an Operator is:

A Custom Resource Definition (CRD) that defines a new k8s resource type (e.g., ClusterPolicy)
A Controller (a pod that runs continuously) that watches that resource and acts on it

Operators are how complex software (databases, ML pipelines, drivers) gets packaged for k8s. There are hundreds of them; AI clusters use a specific handful.

The Operators you'll meet on an AI cluster

Operator	What it manages	When to install
NVIDIA GPU Operator	NVIDIA driver, container toolkit, Device Plugin, DCGM exporter, Node Feature Discovery	Every cluster with GPUs
NVIDIA Network Operator	Mellanox OFED driver, RDMA shared device plugin, IB-K8s, kernel modules	Every cluster with RDMA NICs
SR-IOV Network Operator	VF inventory, NetworkAttachmentDefinitions, SR-IOV CNI	Every cluster doing SR-IOV (= every AI cluster)
Node Feature Discovery (NFD)	Labels nodes with hardware features (CPU model, NUMA topology, NIC count)	Always; comes with GPU Operator
Volcano	Batch job scheduler with gang scheduling	If you're running multi-node training
Prometheus Operator	Telemetry (metrics)	For monitoring
Grafana Operator	Dashboards	For visualizing telemetry

The first three are the load-bearing ones for AI clusters. Without them, you'd be managing drivers and CNI configs by hand on hundreds of nodes — error-prone, slow, and a source of constant production debugging.

Helm — the package manager

Helm is the package manager for k8s. A Helm chart is a templatized bundle of k8s YAML — Operators, Deployments, Services, ConfigMaps, all parameterized.

Installing Helm

curl https://baltocdn.com/helm/signing.asc | sudo apt-key add -
echo "deb https://baltocdn.com/helm/stable/debian/ all main" | sudo tee /etc/apt/sources.list.d/helm-stable-debian.list
sudo apt update && sudo apt install helm

(Or brew install helm on macOS / a release binary on RHEL.)

Using Helm

# Add a repository
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

# Install a chart with default values
helm install gpu-operator nvidia/gpu-operator -n gpu-operator --create-namespace

# Install with custom values (production)
helm install gpu-operator nvidia/gpu-operator \
   -n gpu-operator --create-namespace \
   --version v24.9.1 \
   -f my-values.yaml

# List what's installed
helm list -A

# Upgrade
helm upgrade gpu-operator nvidia/gpu-operator -n gpu-operator --version v24.10.0

# Uninstall
helm uninstall gpu-operator -n gpu-operator

The my-values.yaml file overrides chart defaults — driver version, image registry, feature toggles, etc. Every chart publishes its values.yaml reference so you can see what's configurable.

A concrete walkthrough: installing the GPU Operator

Step by step:

# 1. Add the NVIDIA Helm repo
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

# 2. Create a values file with what you want to customize
cat > gpu-operator-values.yaml <<EOF
driver:
  enabled: true
  version: "550.90.07"
toolkit:
  enabled: true
devicePlugin:
  enabled: true
dcgmExporter:
  enabled: true
nfd:
  enabled: true
mig:
  strategy: none
EOF

# 3. Install
helm install gpu-operator nvidia/gpu-operator \
   -n gpu-operator --create-namespace \
   -f gpu-operator-values.yaml

# 4. Wait for everything to come up
kubectl get pods -n gpu-operator -w

Within 5-10 minutes you should see a long list of pods all in Running state:

gpu-operator-77c44...                                Running
nvidia-driver-daemonset-...                          Running (one per node with GPUs)
nvidia-container-toolkit-daemonset-...               Running (one per node)
nvidia-device-plugin-daemonset-...                   Running (one per node)
nvidia-dcgm-exporter-...                             Running (one per node)
gpu-feature-discovery-...                            Running (one per node)
node-feature-discovery-...                           Running

If any are CrashLoopBackOff, that's your debug target — kubectl logs <pod> will tell you why.

After the Operator finishes, every node has:

NVIDIA driver loaded (installed by the DaemonSet)
Container runtime configured to pass through GPUs
The nvidia.com/gpu resource registered, so pods can request GPUs

Verify with:

kubectl describe node <node-name> | grep nvidia
# Should show:
# nvidia.com/gpu:  8

8 GPUs available for scheduling.

The Network Operator + SR-IOV Operator

Similar pattern. The Network Operator installs OFED + the RDMA device plugin:

helm install network-operator nvidia/network-operator \
   -n nvidia-network-operator --create-namespace \
   --set deployCR=true

Then a NicClusterPolicy CR tells it what to install:

apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy
metadata:
  name: nic-cluster-policy
  namespace: nvidia-network-operator
spec:
  ofedDriver:
    image: doca-driver
    repository: nvcr.io/nvidia/mellanox
    version: 24.10-0.5.5.0
  rdmaSharedDevicePlugin:
    config: |
      {"configList": [{"resourceName": "rdma_shared_device_a", "rdmaHcaMax": 64, "selectors": {"vendors": ["15b3"]}}]}

The SR-IOV Operator (often installed via the Network Operator):

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: rail-0
spec:
  resourceName: rail0_vf
  nodeSelector:
    feature.node.kubernetes.io/network-sriov.capable: "true"
  numVfs: 8
  nicSelector:
    pfNames: ["enp1s0"]
  isRdma: true

This says: on every SR-IOV-capable node, create 8 VFs from enp1s0, label them as the rail0_vf resource so pods can request them. Repeat for rail1_vf, etc.

Reading Operator logs

When something's wrong, Operators tell you in their pod logs.

# What's running
kubectl get pods -n gpu-operator
kubectl get pods -n nvidia-network-operator

# Look at the main Operator pod's logs
kubectl logs -n gpu-operator gpu-operator-77c44...

# Look at a DaemonSet pod's logs (one per node)
kubectl logs -n gpu-operator nvidia-driver-daemonset-abc123

# Watch in real time
kubectl logs -f -n gpu-operator <pod-name>

# Describe the CR (status section is gold)
kubectl describe clusterpolicy cluster-policy -n gpu-operator

The CR's status section usually has structured info on what the Operator is currently doing — drivers installing, DaemonSets waiting, etc.

Common Operator failures

Symptom	Likely cause
Operator pod in `CrashLoopBackOff`	Helm values invalid or unsupported k8s version
DaemonSet pod `Pending` on one node	Node lacks a required label or NFD hasn't run yet
Driver DaemonSet `CrashLoopBackOff`	Kernel version not supported by the bundled driver — update either kernel or driver
`nvidia.com/gpu` resource doesn't appear	Device Plugin DaemonSet not running, or driver didn't load
`rdma_shared_device_a` resource missing	Network Operator's RDMA shared device plugin not running
SR-IOV VFs not appearing	`SriovNetworkNodePolicy` doesn't match the node, or `numVfs` exceeds firmware limit

The pattern: when a resource you expect isn't on the node, find the Operator that should've created it and read its logs.

What you should remember

Operators manage complex stacks declaratively. Install via Helm; configure via Custom Resources.
AI clusters need three load-bearing Operators: GPU Operator, Network Operator, SR-IOV Operator.
Helm is the package manager. helm install, helm upgrade, helm list — standard ops.
Helm values.yaml is how you customize a chart. Don't hand-edit deployed YAML.
kubectl describe <CR> is gold for debugging Operators — the status section tells you what's happening.
DaemonSet pods are one-per-node. If one's broken on a specific node, that node has a problem.

You're done with the K8s section. Head to Cluster Build Guide to see all of this come together in a build, or revisit Linux for Network Engineers for the host side that the Operators rely on.

What an Operator is​

The Operators you'll meet on an AI cluster​

Helm — the package manager​

Installing Helm​

Using Helm​

A concrete walkthrough: installing the GPU Operator​

The Network Operator + SR-IOV Operator​

Reading Operator logs​

Common Operator failures​

What you should remember​