What Is Kubernetes?
You've spent your career on a different abstraction: physical or virtual hosts running services managed by people. Kubernetes is a different abstraction: the cluster runs services, the people describe what they want.
This page is the orientation for someone who knows networking cold but has never touched k8s.
- Explain what k8s does in three sentences — you write a declarative YAML, the scheduler places containers on matching nodes, and dead nodes get rescheduled automatically.
- Translate the core vocabulary — map
Pod,Node,Deployment,Service,Namespace,CNI, and the API server / etcd / scheduler control plane onto fabric, VRF, and SDN-controller analogs you already know. - Trace a job submission end-to-end — from
kubectl applythrough API server, scheduler, kubelet, container runtime, and the CNI chain (Multus) wiringeth0plusnet1..net8RDMA rails. - Name why AI clusters break stock k8s — direct hardware access via Device Plugins (
nvidia.com/gpu), multiple NICs via Multus, batch gang scheduling via Volcano/Kueue.
What Kubernetes actually does
Three sentences:
- You write a declaration ("I want 8 replicas of my training app, each needing 8 GPUs and 8 RDMA NICs") in a YAML file.
- Kubernetes finds servers that match those requirements, starts the containers there, gives them networking, monitors them, restarts them if they crash, scales them up or down on demand.
- If a server dies, k8s reschedules its work somewhere else automatically.
That's it. The rest is implementation detail.
It's an orchestrator — the same role as VMware vCenter or OpenStack Nova, but for containers instead of VMs, and with much more declarative ("describe what you want") and self-healing built in.
Why AI clusters use it
You could run an AI training job on bare metal. Many do. Why use k8s on top?
| Reason | What this gets you |
|---|---|
| Multi-tenant | Multiple teams / jobs share one cluster, isolated from each other |
| Resource scheduling | The cluster picks which servers have free GPU + NIC capacity |
| Failure healing | If one server dies mid-job, k8s reschedules its pods elsewhere |
| Declarative ops | One YAML describes "8 nodes, 64 GPUs, 8 NICs each, this image" — and you can re-create exactly that next month |
| Rolling upgrades | Update the driver / image / config one node at a time without downtime |
| The ecosystem | Volcano, Kueue, Ray, KubeFlow — schedulers and frameworks built for AI on k8s |
For a small cluster (under 32 GPUs) or single-team setup, bare metal is fine. Above that, k8s starts paying off.
The core vocabulary, translated
| K8s term | What it is | Closest network analog |
|---|---|---|
| Pod | One or more containers scheduled together as a unit | A "service instance" — the smallest deployable thing |
| Container | A process tree with its own filesystem and resources | A process, isolated by Linux namespaces and cgroups |
| Node | A physical or virtual server in the cluster | A switch or router in your fabric |
| Cluster | A set of nodes coordinated by k8s | A "fabric" of switches |
| Namespace | A logical grouping of resources (and a security boundary) | A VRF — kind of, but for everything not just routes |
| Deployment | A spec that says "run N pods of this kind, keep them healthy" | The equivalent of "I always want 4 BGP route reflectors running" |
| Service | A stable virtual IP that load-balances to a set of pods | An anycast VIP behind a load balancer |
| Ingress | External HTTP traffic into the cluster | A north-south load balancer |
| DaemonSet | One copy of a pod runs on every node | Like a config that gets pushed to every switch |
| kubelet | The k8s agent running on every node | The agent that lets the control plane talk to the device |
| API server | The central control plane (etcd-backed) | The fabric controller / SDN controller |
| CNI | The plugin that wires pods into networking | Like the OVS / DPDK driver that wires VMs into the underlay |
A worked walkthrough — what happens when you submit a job
You write a YAML like:
apiVersion: batch/v1
kind: Job
metadata:
name: train-llama-3
spec:
parallelism: 32
template:
spec:
containers:
- image: nvcr.io/nvidia/pytorch:24.10-py3
command: ["torchrun", "..."]
resources:
limits:
nvidia.com/gpu: 8
And you apply it: kubectl apply -f job.yaml.
What happens:
- The YAML hits the API server which validates it and stores it in etcd.
- The scheduler sees a new Job that wants 32 pods, each requiring 8 GPUs. It finds 32 nodes that have 8 free GPUs and assigns one pod to each.
- On each node, kubelet sees its assigned pod, asks the container runtime (containerd) to start the container.
- The CNI plugin (or chain, via Multus) wires the pod into the network —
eth0from the default CNI,net1..net8from the SR-IOV CNI for the RDMA rails. - The container starts. Inside,
torchrunfinds its peers via the bootstrap network, opens RDMA QPs, runs the training. - kubelet continually reports back to the API server. If a pod dies, the scheduler picks a new node.
The whole thing is declarative. You said "I want 32 pods of this kind"; k8s makes it happen and keeps it happening.
The bits a network engineer notices
Three things stand out coming from networking:
1. Self-healing is the default
If a node dies, its pods get rescheduled. You don't ssh in and fix it; the cluster does. Coming from "manually configured switches" this feels like magic the first few times. It's also why k8s gets used at scale — fewer human-induced outages.
2. Configuration is data
There's no configure terminal. Everything is YAML. Everything goes through the API server. Everything's versioned (in git, ideally). When a config change goes wrong, you kubectl rollout undo. Compared to telnet'ing into switches and "remember which config you typed," it's a huge leap.
3. The control plane is something you can lose
k8s has its own control plane (API server, etcd, scheduler, controller-manager). If that fails, no new pods can be scheduled — but existing pods keep running. For AI training jobs that run for weeks, this is OK; the control plane only matters for changes. Similar to BGP control plane failure with FIB programmed and a long convergence — the data plane keeps moving.
Where AI clusters break the default
K8s was designed for stateless web workloads. AI training is different:
- Pods need direct hardware access (RDMA NICs, GPUs) — solved by the Device Plugin pattern (
nvidia.com/gpuresource) and SR-IOV CNI for VFs. - Pods need multiple network interfaces — solved by Multus (default k8s gives exactly one).
- Jobs are batch, not stateless — solved by Volcano, Kueue (k8s native), or just bare
Jobresources. - GPUs are expensive; scheduling matters — solved by GPU-aware schedulers and topology hints.
- NCCL bootstrap needs all pods up before any can start — solved by gang scheduling (Volcano).
The good news: every one of these has a solution. The bad news: each is a separate component you have to install and tune. That's why AI clusters look more complex than a stock k8s cluster.
💡 What you should remember
| # | Concept | Why it matters | |
|---|---|---|---|
| 1 | 🧠 | K8s is an orchestrator | describe what you want, the cluster makes it happen. |
| 2 | 📦 | Pod = smallest deployable unit. | Container with its own network namespace, scheduled together. |
| 3 | 🎛️ | The control plane is API server + etcd + scheduler. | Lose them, no new schedules — but existing pods keep running. |
| 4 | 🧩 | AI workloads need extras: | Multus for multiple NICs, SR-IOV CNI for VF passthrough, GPU/Network Operators for the driver chain, gang scheduling for batch jobs. |
| 5 | 🌐 | Coming from networking: | think of k8s as the fabric controller. Pods are the data plane. Nodes are the switches. CNI is the dataplane driver. |
Next: K8s Networking 101 → — how pods get IPs, how services work, what CNI actually does, and where the multi-NIC gap (and Multus) comes from.