Skip to main content

What Is Kubernetes?

You've spent your career on a different abstraction: physical or virtual hosts running services managed by people. Kubernetes is a different abstraction: the cluster runs services, the people describe what they want.

This page is the orientation for someone who knows networking cold but has never touched k8s.


What Kubernetes actually does

Three sentences:

  1. You write a declaration ("I want 8 replicas of my training app, each needing 8 GPUs and 8 RDMA NICs") in a YAML file.
  2. Kubernetes finds servers that match those requirements, starts the containers there, gives them networking, monitors them, restarts them if they crash, scales them up or down on demand.
  3. If a server dies, k8s reschedules its work somewhere else automatically.

That's it. The rest is implementation detail.

It's an orchestrator — the same role as VMware vCenter or OpenStack Nova, but for containers instead of VMs, and with much more declarative ("describe what you want") and self-healing built in.


Why AI clusters use it

You could run an AI training job on bare metal. Many do. Why use k8s on top?

ReasonWhat this gets you
Multi-tenantMultiple teams / jobs share one cluster, isolated from each other
Resource schedulingThe cluster picks which servers have free GPU + NIC capacity
Failure healingIf one server dies mid-job, k8s reschedules its pods elsewhere
Declarative opsOne YAML describes "8 nodes, 64 GPUs, 8 NICs each, this image" — and you can re-create exactly that next month
Rolling upgradesUpdate the driver / image / config one node at a time without downtime
The ecosystemVolcano, Kueue, Ray, KubeFlow — schedulers and frameworks built for AI on k8s

For a small cluster (under 32 GPUs) or single-team setup, bare metal is fine. Above that, k8s starts paying off.


The core vocabulary, translated

K8s termWhat it isClosest network analog
PodOne or more containers scheduled together as a unitA "service instance" — the smallest deployable thing
ContainerA process tree with its own filesystem and resourcesA process, isolated by Linux namespaces and cgroups
NodeA physical or virtual server in the clusterA switch or router in your fabric
ClusterA set of nodes coordinated by k8sA "fabric" of switches
NamespaceA logical grouping of resources (and a security boundary)A VRF — kind of, but for everything not just routes
DeploymentA spec that says "run N pods of this kind, keep them healthy"The equivalent of "I always want 4 BGP route reflectors running"
ServiceA stable virtual IP that load-balances to a set of podsAn anycast VIP behind a load balancer
IngressExternal HTTP traffic into the clusterA north-south load balancer
DaemonSetOne copy of a pod runs on every nodeLike a config that gets pushed to every switch
kubeletThe k8s agent running on every nodeThe agent that lets the control plane talk to the device
API serverThe central control plane (etcd-backed)The fabric controller / SDN controller
CNIThe plugin that wires pods into networkingLike the OVS / DPDK driver that wires VMs into the underlay

A worked walkthrough — what happens when you submit a job

You write a YAML like:

apiVersion: batch/v1
kind: Job
metadata:
name: train-llama-3
spec:
parallelism: 32
template:
spec:
containers:
- image: nvcr.io/nvidia/pytorch:24.10-py3
command: ["torchrun", "..."]
resources:
limits:
nvidia.com/gpu: 8

And you apply it: kubectl apply -f job.yaml.

What happens:

  1. The YAML hits the API server which validates it and stores it in etcd.
  2. The scheduler sees a new Job that wants 32 pods, each requiring 8 GPUs. It finds 32 nodes that have 8 free GPUs and assigns one pod to each.
  3. On each node, kubelet sees its assigned pod, asks the container runtime (containerd) to start the container.
  4. The CNI plugin (or chain, via Multus) wires the pod into the network — eth0 from the default CNI, net1..net8 from the SR-IOV CNI for the RDMA rails.
  5. The container starts. Inside, torchrun finds its peers via the bootstrap network, opens RDMA QPs, runs the training.
  6. kubelet continually reports back to the API server. If a pod dies, the scheduler picks a new node.

The whole thing is declarative. You said "I want 32 pods of this kind"; k8s makes it happen and keeps it happening.


The bits a network engineer notices

Three things stand out coming from networking:

1. Self-healing is the default

If a node dies, its pods get rescheduled. You don't ssh in and fix it; the cluster does. Coming from "manually configured switches" this feels like magic the first few times. It's also why k8s gets used at scale — fewer human-induced outages.

2. Configuration is data

There's no configure terminal. Everything is YAML. Everything goes through the API server. Everything's versioned (in git, ideally). When a config change goes wrong, you kubectl rollout undo. Compared to telnet'ing into switches and "remember which config you typed," it's a huge leap.

3. The control plane is something you can lose

k8s has its own control plane (API server, etcd, scheduler, controller-manager). If that fails, no new pods can be scheduled — but existing pods keep running. For AI training jobs that run for weeks, this is OK; the control plane only matters for changes. Similar to BGP control plane failure with FIB programmed and a long convergence — the data plane keeps moving.


Where AI clusters break the default

K8s was designed for stateless web workloads. AI training is different:

  • Pods need direct hardware access (RDMA NICs, GPUs) — solved by the Device Plugin pattern (nvidia.com/gpu resource) and SR-IOV CNI for VFs.
  • Pods need multiple network interfaces — solved by Multus (default k8s gives exactly one).
  • Jobs are batch, not stateless — solved by Volcano, Kueue (k8s native), or just bare Job resources.
  • GPUs are expensive; scheduling matters — solved by GPU-aware schedulers and topology hints.
  • NCCL bootstrap needs all pods up before any can start — solved by gang scheduling (Volcano).

The good news: every one of these has a solution. The bad news: each is a separate component you have to install and tune. That's why AI clusters look more complex than a stock k8s cluster.


What you should remember

  • K8s is an orchestrator — describe what you want, the cluster makes it happen.
  • Pod = smallest deployable unit. Container with its own network namespace, scheduled together.
  • The control plane is API server + etcd + scheduler. Lose them, no new schedules — but existing pods keep running.
  • AI workloads need extras: Multus for multiple NICs, SR-IOV CNI for VF passthrough, GPU/Network Operators for the driver chain, gang scheduling for batch jobs.
  • Coming from networking: think of k8s as the fabric controller. Pods are the data plane. Nodes are the switches. CNI is the dataplane driver.

Next: K8s Networking 101 → — how pods get IPs, how services work, what CNI actually does, and where the multi-NIC gap (and Multus) comes from.