What Is AI Training?
You do not need to understand calculus. You do not need to read a research paper. You need to understand what AI training does to your network. Here it is.
A model is a giant lookup function
Think of a routing table. It takes an input (a destination IP) and produces an output (a next-hop). A neural network does the same thing — but instead of IP prefixes, the inputs are images, text, or audio, and instead of next-hops, the outputs are predictions.
The "routing table" of a neural network is its parameters (also called weights). They're just numbers. Billions of them.
| Model | Parameters | Size in memory |
|---|---|---|
| Small demo model (~the lab) | 1.2 M | 4.8 MB |
| ResNet-50 (image classifier) | 25 M | 100 MB |
| Llama 2 7B | 7 B | 28 GB |
| GPT-3 | 175 B | 700 GB |
| Llama 3 405B | 405 B | 1.6 TB |
| GPT-4 (estimated) | 1.8 T | 7.2 TB |
These are not just file sizes — these are the weights in GPU memory, replicated across every GPU in the cluster.
Training = convergence
If a model is a routing table, then training is OSPF convergence.
You start with random routes (a garbage parameter set). You send traffic through the network (feed training data through the model). You measure how badly it routes (compute the loss — how wrong the model is). Then you adjust the routes (update the parameters).
Repeat millions of times. The routes get better. The loss goes down. The model converges.
The difference: OSPF converges in seconds. Training a frontier model converges in weeks, across thousands of GPUs, costing millions of dollars per run.
The training loop
Every training step is the same four phases:
- Forward pass — feed a batch of training data through the model. Get a prediction.
- Compute loss — how wrong was that prediction?
- Backward pass — compute the gradient for each parameter: how should this parameter move to make next time's loss smaller. Think of it as an OSPF cost update — every "router" (parameter) gets a delta.
- Update weights — apply the gradient. Nudge every parameter a tiny amount in the right direction.
One full loop = one step. A frontier model trains for hundreds of thousands of steps.
Three flavors, same network pattern
You'll hear AI engineers talk about three styles of training:
- Supervised — show the model labeled examples ("image → cat"). Most enterprise AI today.
- Reinforcement learning (RL) — let the model act, give it a reward signal, repeat. This is how ChatGPT was fine-tuned (RLHF) and how AlphaGo learned to play Go.
- Self-supervised / unsupervised — give the model raw data, no labels, let it learn structure. This is how LLMs are pre-trained.
The math, the data flow inside the GPU, the loss function — these differ. But from your fabric's perspective, they all look the same: each GPU computes a gradient, AllReduce shares it with every other GPU, the model updates, the next step starts. Same traffic shape. Same bandwidth. Same sync frequency. Same tail-latency problem.
This curriculum doesn't distinguish between them. When we say "training" we mean any of the three.
Inference is a different story entirely — different traffic, different fabric design. We'll cover it in its own section once the training-fabric story is complete.
Key vocabulary (network engineer edition)
| AI term | What it means | Networking analogy |
|---|---|---|
| Parameter / Weight | One number in the model | One entry in a routing table |
| Gradient | How much to adjust each parameter this step | OSPF cost update |
| Loss | How wrong the model is (lower = better) | Convergence metric |
| Batch | Several training examples processed together | Bulk route computation |
| Epoch | One complete pass through the dataset | Full topology sweep |
| Step | One forward + backward + update cycle | One SPF calculation |
| Learning rate | How big each adjustment is | OSPF cost multiplier |
| Optimizer | The algorithm that applies gradients to params | SPF, but for matrices |
You'll see these words everywhere in the rest of the curriculum. They're the AI engineer's vocabulary. Now you read them as a network engineer.
What flows across your network
Here's the bridge from "training is math" to "training is traffic on your fabric."
In distributed training (the only kind that matters at scale), every GPU computes its own gradient. Then every GPU has to share its gradient with every other GPU so they all stay in sync. This operation is called AllReduce — and it's the AI equivalent of LSA flooding. Every router needs every other router's update before it can converge.
The gradient is the same size as the model. For GPT-3: 700 GB synchronized across every GPU, every training step, every 2–5 seconds, for weeks.
That's your network traffic. The next page is what happens when it goes wrong.
What you should remember
- A model is a function with billions of numbers (parameters). Same size on every GPU.
- Training adjusts those numbers across millions of steps until the model gives correct answers.
- Gradients are your network traffic — same size as the model, sent every 2–5 seconds, sustained for weeks.
- AllReduce = every GPU floods its gradient to every other GPU. The next step cannot start until that's done.
Next: Why the Network Matters → — the cost of waiting, the inversion of bandwidth vs sync time, parallelism strategies, and the network engineer's cheat sheet.