Skip to main content

What Is AI Training?

You do not need to understand calculus. You do not need to read a research paper. You need to understand what AI training does to your network. Here it is.


A model is a giant lookup function

Think of a routing table. It takes an input (a destination IP) and produces an output (a next-hop). A neural network does the same thing — but instead of IP prefixes, the inputs are images, text, or audio, and instead of next-hops, the outputs are predictions.

The "routing table" of a neural network is its parameters (also called weights). They're just numbers. Billions of them.

ModelParametersSize in memory
Small demo model (~the lab)1.2 M4.8 MB
ResNet-50 (image classifier)25 M100 MB
Llama 2 7B7 B28 GB
GPT-3175 B700 GB
Llama 3 405B405 B1.6 TB
GPT-4 (estimated)1.8 T7.2 TB

These are not just file sizes — these are the weights in GPU memory, replicated across every GPU in the cluster.


Training = convergence

If a model is a routing table, then training is OSPF convergence.

You start with random routes (a garbage parameter set). You send traffic through the network (feed training data through the model). You measure how badly it routes (compute the loss — how wrong the model is). Then you adjust the routes (update the parameters).

Repeat millions of times. The routes get better. The loss goes down. The model converges.

The difference: OSPF converges in seconds. Training a frontier model converges in weeks, across thousands of GPUs, costing millions of dollars per run.


The training loop

Every training step is the same four phases:

  1. Forward pass — feed a batch of training data through the model. Get a prediction.
  2. Compute loss — how wrong was that prediction?
  3. Backward pass — compute the gradient for each parameter: how should this parameter move to make next time's loss smaller. Think of it as an OSPF cost update — every "router" (parameter) gets a delta.
  4. Update weights — apply the gradient. Nudge every parameter a tiny amount in the right direction.

One full loop = one step. A frontier model trains for hundreds of thousands of steps.


Three flavors, same network pattern

You'll hear AI engineers talk about three styles of training:

  • Supervised — show the model labeled examples ("image → cat"). Most enterprise AI today.
  • Reinforcement learning (RL) — let the model act, give it a reward signal, repeat. This is how ChatGPT was fine-tuned (RLHF) and how AlphaGo learned to play Go.
  • Self-supervised / unsupervised — give the model raw data, no labels, let it learn structure. This is how LLMs are pre-trained.

The math, the data flow inside the GPU, the loss function — these differ. But from your fabric's perspective, they all look the same: each GPU computes a gradient, AllReduce shares it with every other GPU, the model updates, the next step starts. Same traffic shape. Same bandwidth. Same sync frequency. Same tail-latency problem.

This curriculum doesn't distinguish between them. When we say "training" we mean any of the three.

Inference is a different story entirely — different traffic, different fabric design. We'll cover it in its own section once the training-fabric story is complete.


Key vocabulary (network engineer edition)

AI termWhat it meansNetworking analogy
Parameter / WeightOne number in the modelOne entry in a routing table
GradientHow much to adjust each parameter this stepOSPF cost update
LossHow wrong the model is (lower = better)Convergence metric
BatchSeveral training examples processed togetherBulk route computation
EpochOne complete pass through the datasetFull topology sweep
StepOne forward + backward + update cycleOne SPF calculation
Learning rateHow big each adjustment isOSPF cost multiplier
OptimizerThe algorithm that applies gradients to paramsSPF, but for matrices

You'll see these words everywhere in the rest of the curriculum. They're the AI engineer's vocabulary. Now you read them as a network engineer.


What flows across your network

Here's the bridge from "training is math" to "training is traffic on your fabric."

In distributed training (the only kind that matters at scale), every GPU computes its own gradient. Then every GPU has to share its gradient with every other GPU so they all stay in sync. This operation is called AllReduce — and it's the AI equivalent of LSA flooding. Every router needs every other router's update before it can converge.

The gradient is the same size as the model. For GPT-3: 700 GB synchronized across every GPU, every training step, every 2–5 seconds, for weeks.

That's your network traffic. The next page is what happens when it goes wrong.


What you should remember

  • A model is a function with billions of numbers (parameters). Same size on every GPU.
  • Training adjusts those numbers across millions of steps until the model gives correct answers.
  • Gradients are your network traffic — same size as the model, sent every 2–5 seconds, sustained for weeks.
  • AllReduce = every GPU floods its gradient to every other GPU. The next step cannot start until that's done.

Next: Why the Network Matters → — the cost of waiting, the inversion of bandwidth vs sync time, parallelism strategies, and the network engineer's cheat sheet.