What Is Kubernetes?

You've spent your career on a different abstraction: physical or virtual hosts running services managed by people. Kubernetes is a different abstraction: the cluster runs services, the people describe what they want.

This page is the orientation for someone who knows networking cold but has never touched k8s.

What Kubernetes actually does

Three sentences:

You write a declaration ("I want 8 replicas of my training app, each needing 8 GPUs and 8 RDMA NICs") in a YAML file.
Kubernetes finds servers that match those requirements, starts the containers there, gives them networking, monitors them, restarts them if they crash, scales them up or down on demand.
If a server dies, k8s reschedules its work somewhere else automatically.

That's it. The rest is implementation detail.

It's an orchestrator — the same role as VMware vCenter or OpenStack Nova, but for containers instead of VMs, and with much more declarative ("describe what you want") and self-healing built in.

Why AI clusters use it

You could run an AI training job on bare metal. Many do. Why use k8s on top?

Reason	What this gets you
Multi-tenant	Multiple teams / jobs share one cluster, isolated from each other
Resource scheduling	The cluster picks which servers have free GPU + NIC capacity
Failure healing	If one server dies mid-job, k8s reschedules its pods elsewhere
Declarative ops	One YAML describes "8 nodes, 64 GPUs, 8 NICs each, this image" — and you can re-create exactly that next month
Rolling upgrades	Update the driver / image / config one node at a time without downtime
The ecosystem	Volcano, Kueue, Ray, KubeFlow — schedulers and frameworks built for AI on k8s

For a small cluster (under 32 GPUs) or single-team setup, bare metal is fine. Above that, k8s starts paying off.

The core vocabulary, translated

K8s term	What it is	Closest network analog
Pod	One or more containers scheduled together as a unit	A "service instance" — the smallest deployable thing
Container	A process tree with its own filesystem and resources	A process, isolated by Linux namespaces and cgroups
Node	A physical or virtual server in the cluster	A switch or router in your fabric
Cluster	A set of nodes coordinated by k8s	A "fabric" of switches
Namespace	A logical grouping of resources (and a security boundary)	A VRF — kind of, but for everything not just routes
Deployment	A spec that says "run N pods of this kind, keep them healthy"	The equivalent of "I always want 4 BGP route reflectors running"
Service	A stable virtual IP that load-balances to a set of pods	An anycast VIP behind a load balancer
Ingress	External HTTP traffic into the cluster	A north-south load balancer
DaemonSet	One copy of a pod runs on every node	Like a config that gets pushed to every switch
kubelet	The k8s agent running on every node	The agent that lets the control plane talk to the device
API server	The central control plane (etcd-backed)	The fabric controller / SDN controller
CNI	The plugin that wires pods into networking	Like the OVS / DPDK driver that wires VMs into the underlay

A worked walkthrough — what happens when you submit a job

You write a YAML like:

apiVersion: batch/v1
kind: Job
metadata:
  name: train-llama-3
spec:
  parallelism: 32
  template:
    spec:
      containers:
      - image: nvcr.io/nvidia/pytorch:24.10-py3
        command: ["torchrun", "..."]
        resources:
          limits:
            nvidia.com/gpu: 8

And you apply it: kubectl apply -f job.yaml.

What happens:

The YAML hits the API server which validates it and stores it in etcd.
The scheduler sees a new Job that wants 32 pods, each requiring 8 GPUs. It finds 32 nodes that have 8 free GPUs and assigns one pod to each.
On each node, kubelet sees its assigned pod, asks the container runtime (containerd) to start the container.
The CNI plugin (or chain, via Multus) wires the pod into the network — eth0 from the default CNI, net1..net8 from the SR-IOV CNI for the RDMA rails.
The container starts. Inside, torchrun finds its peers via the bootstrap network, opens RDMA QPs, runs the training.
kubelet continually reports back to the API server. If a pod dies, the scheduler picks a new node.

The whole thing is declarative. You said "I want 32 pods of this kind"; k8s makes it happen and keeps it happening.

The bits a network engineer notices

Three things stand out coming from networking:

1. Self-healing is the default

If a node dies, its pods get rescheduled. You don't ssh in and fix it; the cluster does. Coming from "manually configured switches" this feels like magic the first few times. It's also why k8s gets used at scale — fewer human-induced outages.

2. Configuration is data

There's no configure terminal. Everything is YAML. Everything goes through the API server. Everything's versioned (in git, ideally). When a config change goes wrong, you kubectl rollout undo. Compared to telnet'ing into switches and "remember which config you typed," it's a huge leap.

3. The control plane is something you can lose

k8s has its own control plane (API server, etcd, scheduler, controller-manager). If that fails, no new pods can be scheduled — but existing pods keep running. For AI training jobs that run for weeks, this is OK; the control plane only matters for changes. Similar to BGP control plane failure with FIB programmed and a long convergence — the data plane keeps moving.

Where AI clusters break the default

K8s was designed for stateless web workloads. AI training is different:

Pods need direct hardware access (RDMA NICs, GPUs) — solved by the Device Plugin pattern (nvidia.com/gpu resource) and SR-IOV CNI for VFs.
Pods need multiple network interfaces — solved by Multus (default k8s gives exactly one).
Jobs are batch, not stateless — solved by Volcano, Kueue (k8s native), or just bare Job resources.
GPUs are expensive; scheduling matters — solved by GPU-aware schedulers and topology hints.
NCCL bootstrap needs all pods up before any can start — solved by gang scheduling (Volcano).

The good news: every one of these has a solution. The bad news: each is a separate component you have to install and tune. That's why AI clusters look more complex than a stock k8s cluster.

What you should remember

K8s is an orchestrator — describe what you want, the cluster makes it happen.
Pod = smallest deployable unit. Container with its own network namespace, scheduled together.
The control plane is API server + etcd + scheduler. Lose them, no new schedules — but existing pods keep running.
AI workloads need extras: Multus for multiple NICs, SR-IOV CNI for VF passthrough, GPU/Network Operators for the driver chain, gang scheduling for batch jobs.
Coming from networking: think of k8s as the fabric controller. Pods are the data plane. Nodes are the switches. CNI is the dataplane driver.

Next: K8s Networking 101 → — how pods get IPs, how services work, what CNI actually does, and where the multi-NIC gap (and Multus) comes from.

What Kubernetes actually does​

Why AI clusters use it​

The core vocabulary, translated​

A worked walkthrough — what happens when you submit a job​

The bits a network engineer notices​

1. Self-healing is the default​

2. Configuration is data​

3. The control plane is something you can lose​

Where AI clusters break the default​

What you should remember​