What Is Kubernetes?
You've spent your career on a different abstraction: physical or virtual hosts running services managed by people. Kubernetes is a different abstraction: the cluster runs services, the people describe what they want.
This page is the orientation for someone who knows networking cold but has never touched k8s.
What Kubernetes actually does
Three sentences:
- You write a declaration ("I want 8 replicas of my training app, each needing 8 GPUs and 8 RDMA NICs") in a YAML file.
- Kubernetes finds servers that match those requirements, starts the containers there, gives them networking, monitors them, restarts them if they crash, scales them up or down on demand.
- If a server dies, k8s reschedules its work somewhere else automatically.
That's it. The rest is implementation detail.
It's an orchestrator — the same role as VMware vCenter or OpenStack Nova, but for containers instead of VMs, and with much more declarative ("describe what you want") and self-healing built in.
Why AI clusters use it
You could run an AI training job on bare metal. Many do. Why use k8s on top?
| Reason | What this gets you |
|---|---|
| Multi-tenant | Multiple teams / jobs share one cluster, isolated from each other |
| Resource scheduling | The cluster picks which servers have free GPU + NIC capacity |
| Failure healing | If one server dies mid-job, k8s reschedules its pods elsewhere |
| Declarative ops | One YAML describes "8 nodes, 64 GPUs, 8 NICs each, this image" — and you can re-create exactly that next month |
| Rolling upgrades | Update the driver / image / config one node at a time without downtime |
| The ecosystem | Volcano, Kueue, Ray, KubeFlow — schedulers and frameworks built for AI on k8s |
For a small cluster (under 32 GPUs) or single-team setup, bare metal is fine. Above that, k8s starts paying off.
The core vocabulary, translated
| K8s term | What it is | Closest network analog |
|---|---|---|
| Pod | One or more containers scheduled together as a unit | A "service instance" — the smallest deployable thing |
| Container | A process tree with its own filesystem and resources | A process, isolated by Linux namespaces and cgroups |
| Node | A physical or virtual server in the cluster | A switch or router in your fabric |
| Cluster | A set of nodes coordinated by k8s | A "fabric" of switches |
| Namespace | A logical grouping of resources (and a security boundary) | A VRF — kind of, but for everything not just routes |
| Deployment | A spec that says "run N pods of this kind, keep them healthy" | The equivalent of "I always want 4 BGP route reflectors running" |
| Service | A stable virtual IP that load-balances to a set of pods | An anycast VIP behind a load balancer |
| Ingress | External HTTP traffic into the cluster | A north-south load balancer |
| DaemonSet | One copy of a pod runs on every node | Like a config that gets pushed to every switch |
| kubelet | The k8s agent running on every node | The agent that lets the control plane talk to the device |
| API server | The central control plane (etcd-backed) | The fabric controller / SDN controller |
| CNI | The plugin that wires pods into networking | Like the OVS / DPDK driver that wires VMs into the underlay |
A worked walkthrough — what happens when you submit a job
You write a YAML like:
apiVersion: batch/v1
kind: Job
metadata:
name: train-llama-3
spec:
parallelism: 32
template:
spec:
containers:
- image: nvcr.io/nvidia/pytorch:24.10-py3
command: ["torchrun", "..."]
resources:
limits:
nvidia.com/gpu: 8
And you apply it: kubectl apply -f job.yaml.
What happens:
- The YAML hits the API server which validates it and stores it in etcd.
- The scheduler sees a new Job that wants 32 pods, each requiring 8 GPUs. It finds 32 nodes that have 8 free GPUs and assigns one pod to each.
- On each node, kubelet sees its assigned pod, asks the container runtime (containerd) to start the container.
- The CNI plugin (or chain, via Multus) wires the pod into the network —
eth0from the default CNI,net1..net8from the SR-IOV CNI for the RDMA rails. - The container starts. Inside,
torchrunfinds its peers via the bootstrap network, opens RDMA QPs, runs the training. - kubelet continually reports back to the API server. If a pod dies, the scheduler picks a new node.
The whole thing is declarative. You said "I want 32 pods of this kind"; k8s makes it happen and keeps it happening.
The bits a network engineer notices
Three things stand out coming from networking:
1. Self-healing is the default
If a node dies, its pods get rescheduled. You don't ssh in and fix it; the cluster does. Coming from "manually configured switches" this feels like magic the first few times. It's also why k8s gets used at scale — fewer human-induced outages.
2. Configuration is data
There's no configure terminal. Everything is YAML. Everything goes through the API server. Everything's versioned (in git, ideally). When a config change goes wrong, you kubectl rollout undo. Compared to telnet'ing into switches and "remember which config you typed," it's a huge leap.
3. The control plane is something you can lose
k8s has its own control plane (API server, etcd, scheduler, controller-manager). If that fails, no new pods can be scheduled — but existing pods keep running. For AI training jobs that run for weeks, this is OK; the control plane only matters for changes. Similar to BGP control plane failure with FIB programmed and a long convergence — the data plane keeps moving.
Where AI clusters break the default
K8s was designed for stateless web workloads. AI training is different:
- Pods need direct hardware access (RDMA NICs, GPUs) — solved by the Device Plugin pattern (
nvidia.com/gpuresource) and SR-IOV CNI for VFs. - Pods need multiple network interfaces — solved by Multus (default k8s gives exactly one).
- Jobs are batch, not stateless — solved by Volcano, Kueue (k8s native), or just bare
Jobresources. - GPUs are expensive; scheduling matters — solved by GPU-aware schedulers and topology hints.
- NCCL bootstrap needs all pods up before any can start — solved by gang scheduling (Volcano).
The good news: every one of these has a solution. The bad news: each is a separate component you have to install and tune. That's why AI clusters look more complex than a stock k8s cluster.
What you should remember
- K8s is an orchestrator — describe what you want, the cluster makes it happen.
- Pod = smallest deployable unit. Container with its own network namespace, scheduled together.
- The control plane is API server + etcd + scheduler. Lose them, no new schedules — but existing pods keep running.
- AI workloads need extras: Multus for multiple NICs, SR-IOV CNI for VF passthrough, GPU/Network Operators for the driver chain, gang scheduling for batch jobs.
- Coming from networking: think of k8s as the fabric controller. Pods are the data plane. Nodes are the switches. CNI is the dataplane driver.
Next: K8s Networking 101 → — how pods get IPs, how services work, what CNI actually does, and where the multi-NIC gap (and Multus) comes from.