How Inference Differs from Training
You've spent eight sections learning the training fabric — 8,000 GPUs synchronously gradient-AllReducing every 2 seconds, 700 GB collective bursts, RoCE v2 with PFC and ECN tuned for elephant flows.
Inference is none of those things.
Inference is millions of small, asynchronous, latency-critical requests. The math is different, the traffic shape is different, the fabric design is different. This page is the bridge.
What an inference workload looks like
A user types a question into a chatbot. Three things happen inside the cluster:
- The frontend routes the request to a model replica.
- The model replica runs the model — forward pass only, no backward, no gradient. Produces tokens one at a time.
- The response streams back to the user, token by token.
If the model is 70 B parameters, it fits on 2–4 GPUs (using tensor parallelism). At scale, you have hundreds or thousands of these replicas, each handling its share of incoming requests.
So an inference cluster is:
- Many small "model replicas" (2–8 GPUs each), each independent
- A frontend / router that load-balances requests across replicas
- A storage layer for KV-cache (model state during decode)
- Optionally, retrieval (RAG) hits to a vector DB or document store
The big differences
| Dimension | Training | Inference |
|---|---|---|
| Traffic shape | Synchronous collectives (AllReduce) | Asynchronous request/response |
| Flow size | GBs (gradient = model size) | KBs–MBs (request + KV-cache movement) |
| Concurrency | One job, all GPUs in lockstep | Thousands of concurrent requests |
| Latency budget | Per-step (~50 ms tolerable) | Per-token (~10–50 ms TTFT critical) |
| Failure tolerance | One slow link stalls everyone | Slow replicas only affect their requests |
| GPU usage | Backward pass = 50% of time | All compute, no backward |
| Network protocol | RDMA over RoCE v2 / IB | Mix: TCP for control, RDMA for KV-cache, gRPC for routing |
| Multi-tenancy | One tenant per job | Many tenants share replicas |
| Scaling pattern | Vertical (more GPUs per job) | Horizontal (more replicas) |
Why the fabric design differs
If you built an inference cluster the same way you built training, you'd over-engineer it.
Training needs:
- Lossless RDMA (PFC + ECN), 1:1 oversub, rail topology
- Every link mission-critical (one slow link stalls everything)
- 400/800 G end to end
Inference needs:
- Best-effort TCP often fine for request routing (gRPC, HTTP)
- Some RDMA for intra-replica or KV-cache transfer
- 4:1 or 2:1 oversub is OK — flows are independent
- Multi-zone fault tolerance — replicas can be lost without affecting other tenants
- 100/200 G usually sufficient (per-flow demand is low)
Many production inference clusters run on standard DC Ethernet with optional RDMA islands for replicas that need it (large-context models, multi-host models). They look closer to a web tier than a training fabric.
When inference DOES need a training-style fabric
Three cases push inference back toward training-fabric design:
- Multi-host models. A model that doesn't fit on one host (e.g., 405B+ at FP16) needs tensor parallelism across hosts. Now you need RDMA between replica nodes — same constraints as training (one slow link stalls the model).
- Disaggregated prefill/decode. Some architectures split prefill and decode across different node pools, shipping KV-cache between them. Large KV-cache transfers (GBs) want RDMA.
- Very large speculative-decoding pipelines. Draft model + main model + multiple ranks. Looks more like training than like a web tier.
For most inference workloads (≤70 B models on one host), none of this applies. Inference is a Layer-7 problem, not a Layer-2 problem.
The latency that matters
In training, you optimize for steady-state throughput — bytes per second under sustained collectives.
In inference, you optimize for two latency metrics:
- TTFT — Time To First Token. The user types a question; how long until the first character of the response appears? This depends on prefill speed (running the model on the entire input prompt before generating the first output token).
- TPOT — Time Per Output Token. Once decoding starts, how fast can you produce each subsequent token? This is the steady-state generation throughput.
A 70 B model on 4 H100s can do:
- TTFT ≈ 200–500 ms for a 2K-token prompt
- TPOT ≈ 30–50 ms per token (20–30 tokens/sec)
Network contribution to these:
- TTFT: dominated by GPU compute (prefill is FLOPs-heavy). Network = ~5–10 ms.
- TPOT: dominated by HBM memory bandwidth (decode is memory-bound). Network = ~1–2 ms unless there's KV-cache movement.
So inference network latency is important but not dominant for single-replica scenarios. It becomes dominant only when KV-cache moves across replicas (page 2 covers this).
What you should remember
- Inference is not training. Different traffic, different latency budget, different fabric.
- Standard DC Ethernet handles most inference. RDMA islands exist for special cases (multi-host models, disaggregated serving).
- TTFT and TPOT are the latency metrics, not AllReduce time.
- GPU compute and HBM bandwidth usually dominate — network is a secondary contributor for typical single-host replicas.
- Multi-host models, disaggregated prefill/decode, and large speculative pipelines are the exceptions where inference looks more like training.
Next: Prefill, Decode, and KV-Cache → — what's actually flowing on the wire when an inference request runs.