Prefill, Decode, and KV-Cache
To know what traffic your inference fabric carries, you need to know what the GPUs are actually doing. Inference has two phases — prefill and decode — with completely different compute and network characteristics.
This page explains both, what's on the wire, and where KV-cache fits in.
Phase 1: Prefill
When a user sends a prompt — say, 2,000 tokens of context — the model processes all of those tokens at once through the forward pass. For each token, the attention layer computes Key (K) and Value (V) vectors against every previous token.
Compute pattern: highly parallel, FLOPs-bound. The GPU's tensor cores are saturated.
Output: the next token (the first output token) and a giant KV-cache — every K and V vector for every layer for every input token. For a 70 B model with 2K-token input, the KV-cache is roughly 2–4 GB.
The KV-cache is stored in GPU HBM. It's the model's "memory" of the conversation so far.
Network during prefill: almost zero. The whole prefill happens on the GPUs that own the model replica. The only network traffic is the input prompt (a few KB) arriving from the frontend.
Phase 2: Decode
Once prefill produces the first output token, the model generates subsequent tokens one at a time. For each new token:
- Compute K, V for this one new token
- Append to KV-cache
- Compute attention against the entire KV-cache (this is now O(N) operations against N existing K/V pairs)
- Produce the next token
- Repeat
Compute pattern: memory-bound, not FLOPs-bound. Every step reads the entire KV-cache from HBM. The GPU's HBM bandwidth (3 TB/s on H100) is the bottleneck.
Output: one token at a time. Streamed back to the user.
Network during decode: still almost zero per token. Tokens are tiny (a few bytes each). The only ongoing traffic is the output token stream back to the frontend.
So for a single-replica inference, the network is mostly idle. Most inference traffic is not the inference itself.
Where the network gets busy
Three patterns make inference network-heavy.
1. Multi-host model (tensor parallelism)
If the model doesn't fit on one host, you split it across multiple hosts. Tensor parallelism splits each layer's weights across GPUs. During every forward pass, the GPUs have to AllReduce intermediate activations across hosts.
This is exactly like training's AllReduce, except:
- Smaller payload (activations are MBs, not GBs of gradients)
- More frequent (every forward pass = once per token decoded)
- Latency-critical (each AllReduce blocks the next token)
For a multi-host model serving at 30 tokens/sec, you're doing 30 AllReduces per second across hosts. The fabric needs RoCE v2 or InfiniBand — TCP won't cut it.
2. Disaggregated prefill / decode
A newer architecture pattern: run prefill on one pool of GPUs (high FLOPs density, good for compute-bound prefill), then ship the KV-cache over the network to a different pool of GPUs (high memory bandwidth, good for memory-bound decode).
After prefill on Node A finishes:
- Send the entire KV-cache (multiple GB) over the network to Node B
- Decode runs on Node B, freeing Node A for the next prefill request
Throughput per dollar improves significantly because each pool is specialized. But the network now has to ship GBs per request between prefill and decode pools. This only works with RDMA — TCP cannot move 4 GB in the time budget a single inference request gets.
3. RAG / vector retrieval
Some inference requests need to retrieve external documents before generating. The flow:
- User query arrives
- Frontend embeds the query (small model, fast)
- Frontend hits a vector database (e.g., Pinecone, Milvus, or a Postgres pgvector) to find top-K matching documents
- Retrieved documents (say, 4 documents × 2 KB each = 8 KB) are added to the prompt
- Prefill runs on the augmented prompt
- Decode produces the answer
The vector DB hit is small but adds latency to TTFT — typically 10–50 ms. This is where inference fabric design starts to matter: you want the vector DB and the GPU pool in the same datacenter, on a low-latency path.
KV-cache deep dive
KV-cache is the single biggest reason inference networks look like they do. It's worth understanding the sizes:
For a Llama-3 70 B model in FP16:
| Context length | KV-cache size |
|---|---|
| 1K tokens | ~1.5 GB |
| 2K tokens | ~3 GB |
| 8K tokens | ~12 GB |
| 32K tokens | ~50 GB |
| 128K tokens | ~200 GB |
Long context (32K+) makes KV-cache larger than the model itself. Moving this around (disaggregated serving, replica failover, request batching across replicas) becomes a major fabric question.
KV-cache management is now one of the hottest research areas in inference infrastructure:
- Paged attention (vLLM) — store KV-cache in fixed-size pages, like virtual memory. Enables sharing across requests.
- Prefix caching — if many requests share the same system prompt, share the KV-cache for the common prefix.
- KV-cache offload to host DRAM or NVMe — for very long contexts, evict old KV pages to slower memory.
- Compressed KV-cache — quantize K/V vectors to FP8 or even INT4.
Each of these affects what flows on the inference network. If you're designing an inference fabric, ask the model team which they use.
What you should remember
- Inference has two phases: prefill (FLOPs-bound, no network) and decode (memory-bound, no network).
- KV-cache is the model's "memory" of the conversation — stored in GPU HBM, sized by context length.
- The network is mostly idle for single-host inference. Most traffic is the request and the token stream.
- Three patterns make inference network-heavy: multi-host models, disaggregated prefill/decode, and RAG retrieval.
- Long context = huge KV-cache — can be larger than the model itself at 32K+ tokens.
- KV-cache management (paged attention, prefix caching, offload) is the active research area shaping inference fabric design.
Next: RAG, MCP, and Inference Fabric Design → — retrieval patterns, the Model Context Protocol, and how to actually design an inference fabric.