Skip to main content

How Inference Differs from Training

You've spent eight sections learning the training fabric — 8,000 GPUs synchronously gradient-AllReducing every 2 seconds, 700 GB collective bursts, RoCE v2 with PFC and ECN tuned for elephant flows.

Inference is none of those things.

Inference is millions of small, asynchronous, latency-critical requests. The math is different, the traffic shape is different, the fabric design is different. This page is the bridge.


What an inference workload looks like

A user types a question into a chatbot. Three things happen inside the cluster:

  1. The frontend routes the request to a model replica.
  2. The model replica runs the model — forward pass only, no backward, no gradient. Produces tokens one at a time.
  3. The response streams back to the user, token by token.

If the model is 70 B parameters, it fits on 2–4 GPUs (using tensor parallelism). At scale, you have hundreds or thousands of these replicas, each handling its share of incoming requests.

So an inference cluster is:

  • Many small "model replicas" (2–8 GPUs each), each independent
  • A frontend / router that load-balances requests across replicas
  • A storage layer for KV-cache (model state during decode)
  • Optionally, retrieval (RAG) hits to a vector DB or document store

The big differences

DimensionTrainingInference
Traffic shapeSynchronous collectives (AllReduce)Asynchronous request/response
Flow sizeGBs (gradient = model size)KBs–MBs (request + KV-cache movement)
ConcurrencyOne job, all GPUs in lockstepThousands of concurrent requests
Latency budgetPer-step (~50 ms tolerable)Per-token (~10–50 ms TTFT critical)
Failure toleranceOne slow link stalls everyoneSlow replicas only affect their requests
GPU usageBackward pass = 50% of timeAll compute, no backward
Network protocolRDMA over RoCE v2 / IBMix: TCP for control, RDMA for KV-cache, gRPC for routing
Multi-tenancyOne tenant per jobMany tenants share replicas
Scaling patternVertical (more GPUs per job)Horizontal (more replicas)

Why the fabric design differs

If you built an inference cluster the same way you built training, you'd over-engineer it.

Training needs:

  • Lossless RDMA (PFC + ECN), 1:1 oversub, rail topology
  • Every link mission-critical (one slow link stalls everything)
  • 400/800 G end to end

Inference needs:

  • Best-effort TCP often fine for request routing (gRPC, HTTP)
  • Some RDMA for intra-replica or KV-cache transfer
  • 4:1 or 2:1 oversub is OK — flows are independent
  • Multi-zone fault tolerance — replicas can be lost without affecting other tenants
  • 100/200 G usually sufficient (per-flow demand is low)

Many production inference clusters run on standard DC Ethernet with optional RDMA islands for replicas that need it (large-context models, multi-host models). They look closer to a web tier than a training fabric.


When inference DOES need a training-style fabric

Three cases push inference back toward training-fabric design:

  1. Multi-host models. A model that doesn't fit on one host (e.g., 405B+ at FP16) needs tensor parallelism across hosts. Now you need RDMA between replica nodes — same constraints as training (one slow link stalls the model).
  2. Disaggregated prefill/decode. Some architectures split prefill and decode across different node pools, shipping KV-cache between them. Large KV-cache transfers (GBs) want RDMA.
  3. Very large speculative-decoding pipelines. Draft model + main model + multiple ranks. Looks more like training than like a web tier.

For most inference workloads (≤70 B models on one host), none of this applies. Inference is a Layer-7 problem, not a Layer-2 problem.


The latency that matters

In training, you optimize for steady-state throughput — bytes per second under sustained collectives.

In inference, you optimize for two latency metrics:

  • TTFT — Time To First Token. The user types a question; how long until the first character of the response appears? This depends on prefill speed (running the model on the entire input prompt before generating the first output token).
  • TPOT — Time Per Output Token. Once decoding starts, how fast can you produce each subsequent token? This is the steady-state generation throughput.

A 70 B model on 4 H100s can do:

  • TTFT ≈ 200–500 ms for a 2K-token prompt
  • TPOT ≈ 30–50 ms per token (20–30 tokens/sec)

Network contribution to these:

  • TTFT: dominated by GPU compute (prefill is FLOPs-heavy). Network = ~5–10 ms.
  • TPOT: dominated by HBM memory bandwidth (decode is memory-bound). Network = ~1–2 ms unless there's KV-cache movement.

So inference network latency is important but not dominant for single-replica scenarios. It becomes dominant only when KV-cache moves across replicas (page 2 covers this).


What you should remember

  • Inference is not training. Different traffic, different latency budget, different fabric.
  • Standard DC Ethernet handles most inference. RDMA islands exist for special cases (multi-host models, disaggregated serving).
  • TTFT and TPOT are the latency metrics, not AllReduce time.
  • GPU compute and HBM bandwidth usually dominate — network is a secondary contributor for typical single-host replicas.
  • Multi-host models, disaggregated prefill/decode, and large speculative pipelines are the exceptions where inference looks more like training.

Next: Prefill, Decode, and KV-Cache → — what's actually flowing on the wire when an inference request runs.