What Is AI Training?

You do not need to understand calculus. You do not need to read a research paper. You need to understand what AI training does to your network. Here it is.

A model is a giant lookup function

Think of a routing table. It takes an input (a destination IP) and produces an output (a next-hop). A neural network does the same thing — but instead of IP prefixes, the inputs are images, text, or audio, and instead of next-hops, the outputs are predictions.

The "routing table" of a neural network is its parameters (also called weights). They're just numbers. Billions of them.

Model	Parameters	Size in memory
Small demo model (~the lab)	1.2 M	4.8 MB
ResNet-50 (image classifier)	25 M	100 MB
Llama 2 7B	7 B	28 GB
GPT-3	175 B	700 GB
Llama 3 405B	405 B	1.6 TB
GPT-4 (estimated)	1.8 T	7.2 TB

These are not just file sizes — these are the weights in GPU memory, replicated across every GPU in the cluster.

Training = convergence

If a model is a routing table, then training is OSPF convergence.

You start with random routes (a garbage parameter set). You send traffic through the network (feed training data through the model). You measure how badly it routes (compute the loss — how wrong the model is). Then you adjust the routes (update the parameters).

Repeat millions of times. The routes get better. The loss goes down. The model converges.

The difference: OSPF converges in seconds. Training a frontier model converges in weeks, across thousands of GPUs, costing millions of dollars per run.

The training loop

Every training step is the same four phases:

Forward pass — feed a batch of training data through the model. Get a prediction.
Compute loss — how wrong was that prediction?
Backward pass — compute the gradient for each parameter: how should this parameter move to make next time's loss smaller. Think of it as an OSPF cost update — every "router" (parameter) gets a delta.
Update weights — apply the gradient. Nudge every parameter a tiny amount in the right direction.

One full loop = one step. A frontier model trains for hundreds of thousands of steps.

Three flavors, same network pattern

You'll hear AI engineers talk about three styles of training:

Supervised — show the model labeled examples ("image → cat"). Most enterprise AI today.
Reinforcement learning (RL) — let the model act, give it a reward signal, repeat. This is how ChatGPT was fine-tuned (RLHF) and how AlphaGo learned to play Go.
Self-supervised / unsupervised — give the model raw data, no labels, let it learn structure. This is how LLMs are pre-trained.

The math, the data flow inside the GPU, the loss function — these differ. But from your fabric's perspective, they all look the same: each GPU computes a gradient, AllReduce shares it with every other GPU, the model updates, the next step starts. Same traffic shape. Same bandwidth. Same sync frequency. Same tail-latency problem.

This curriculum doesn't distinguish between them. When we say "training" we mean any of the three.

Inference is a different story entirely — different traffic, different fabric design. We'll cover it in its own section once the training-fabric story is complete.

Key vocabulary (network engineer edition)

AI term	What it means	Networking analogy
Parameter / Weight	One number in the model	One entry in a routing table
Gradient	How much to adjust each parameter this step	OSPF cost update
Loss	How wrong the model is (lower = better)	Convergence metric
Batch	Several training examples processed together	Bulk route computation
Epoch	One complete pass through the dataset	Full topology sweep
Step	One forward + backward + update cycle	One SPF calculation
Learning rate	How big each adjustment is	OSPF cost multiplier
Optimizer	The algorithm that applies gradients to params	SPF, but for matrices

You'll see these words everywhere in the rest of the curriculum. They're the AI engineer's vocabulary. Now you read them as a network engineer.

What flows across your network

Here's the bridge from "training is math" to "training is traffic on your fabric."

In distributed training (the only kind that matters at scale), every GPU computes its own gradient. Then every GPU has to share its gradient with every other GPU so they all stay in sync. This operation is called AllReduce — and it's the AI equivalent of LSA flooding. Every router needs every other router's update before it can converge.

The gradient is the same size as the model. For GPT-3: 700 GB synchronized across every GPU, every training step, every 2–5 seconds, for weeks.

That's your network traffic. The next page is what happens when it goes wrong.

What you should remember

A model is a function with billions of numbers (parameters). Same size on every GPU.
Training adjusts those numbers across millions of steps until the model gives correct answers.
Gradients are your network traffic — same size as the model, sent every 2–5 seconds, sustained for weeks.
AllReduce = every GPU floods its gradient to every other GPU. The next step cannot start until that's done.

Next: Why the Network Matters → — the cost of waiting, the inversion of bandwidth vs sync time, parallelism strategies, and the network engineer's cheat sheet.

A model is a giant lookup function​

Training = convergence​

The training loop​

Three flavors, same network pattern​

Key vocabulary (network engineer edition)​

What flows across your network​

What you should remember​