RAG, MCP, and Inference Fabric Design
This page is the synthesis. RAG and MCP add traffic patterns that didn't exist a few years ago. The inference fabric you build today has to accommodate them — and decide what to share with the training fabric.
RAG — Retrieval-Augmented Generation
RAG is the dominant pattern for inference with external knowledge. Instead of asking a model "what's our company's vacation policy?" and hoping it memorized it, you:
- Embed the user query into a vector
- Search a vector database for the most similar documents
- Augment the prompt with those documents
- Generate the answer with the augmented context
The architecture:
User → Frontend
│
├── Embed (small model, ~50ms)
│
├── Vector DB query (Pinecone / Milvus / pgvector, ~20-50ms)
│ ↓ top-K results
│
├── Augment prompt (concatenate doc text + user query)
│
└── Inference (the big model, the slow part)
│
└── Response stream
Network impact:
- Embedding model is usually small (BERT-class, 100M-1B params) — runs on a fraction of one GPU. Often co-located with the frontend.
- Vector DB hits are small but latency-sensitive: ~1-5 KB per query, but each ms adds to TTFT.
- The augmented prompt is larger (e.g., 4 docs × 2 KB = 8 KB extra context), making prefill take slightly longer.
Fabric implication: RAG wants the vector DB and the inference GPUs on a low-latency path — same datacenter, ideally same rack-level network. Cross-region RAG is painful.
MCP — Model Context Protocol
MCP (Model Context Protocol, Anthropic 2024) is a standard for connecting AI models to tools, data sources, and other systems. Think of it as "an HTTP for AI tool-use."
When a model calls a tool (database query, code execution, web search), the call goes through an MCP server. The model produces a tool-call request → MCP server executes → MCP server returns the result → model continues generating.
Traffic pattern:
- Request/response, much like HTTP
- Small payloads (JSON, a few KB)
- Multi-turn — one inference can trigger 5-20 tool calls
- Latency-sensitive — each tool call blocks the model
Fabric implication: MCP servers should be close to the inference cluster. Same datacenter, low-latency path. If MCP calls cross WAN, every tool call adds tens of ms.
Designing an inference fabric
You have a few architectural decisions to make.
Decision 1: Share fabric with training?
Option A: One big fabric, training and inference both on it.
Pros: simpler, fewer cables, infrastructure team manages one thing. Cons: training's elephant flows will starve inference's latency-critical small flows. PFC/ECN tuned for training is wrong for inference.
Option B: Separate fabrics.
Pros: each fabric tuned for its workload. Inference can run on lower-cost gear. Cons: more infrastructure, more teams.
Option C: Same physical fabric, different priority classes / VLANs.
Pros: shared hardware, isolated traffic. Best of both worlds in theory. Cons: requires careful QoS engineering; hard to get right at scale.
Most large operators do B for the GPU fabric and share C for management / storage. Inference clusters run on standard DC Ethernet with optional RDMA islands.
Decision 2: Disaggregate prefill / decode?
This is the newer choice. If you do:
- Prefill pool — high FLOPs density (H100/B200), few cards needed per request, sustained throughput pattern
- Decode pool — high memory bandwidth (also H100/B200), more cards per request, latency-sensitive
- Inter-pool network — RDMA, big enough to move multi-GB KV-cache per request
If you don't disaggregate, you co-locate prefill and decode on the same GPUs. Simpler operationally, less GPU utilization optimization.
Decision 3: Where does the vector DB live?
For RAG, the vector DB is hot. Options:
- Same cluster as inference — lowest latency, smaller scale.
- Adjacent cluster, same DC — common pattern for prod RAG at scale.
- Managed service (Pinecone, etc.) — easiest to operate, adds WAN latency.
For high-throughput RAG, you want the vector DB co-located. The first ms of RAG latency hurts user experience visibly.
Decision 4: Edge vs centralized inference?
For consumer-facing applications (chatbots, search), you might run inference replicas in multiple regions to reduce user-perceived latency. This adds:
- Model replication across regions (one-time cost per release)
- Routing decisions (which region serves which user)
- KV-cache locality (sticky sessions per user-region pair)
Most large inference workloads are still centralized — the math says it's cheaper to serve everyone from a few mega-clusters than to distribute. But this is changing as models get smaller and edge GPUs get more capable.
What this curriculum picks
For an inference fabric we'd recommend by default:
- Standard DC Ethernet for the inference pool (no RoCE v2 unless you have multi-host models or disaggregated serving)
- 100/200 G per server (plenty for token streaming and RAG)
- Vector DB co-located with inference GPUs, same datacenter
- L7 load balancer (Envoy, HAProxy) for request routing — not a hardware LB
- Sticky sessions at the routing layer for KV-cache locality
- Telemetry per-replica TTFT, TPOT, and KV-cache hit rate — these are your golden signals
If you grow into multi-host models or disaggregated prefill/decode, that's when you add RDMA islands for the specific traffic that needs it.
What you should remember
- RAG adds vector DB traffic — small but latency-sensitive. Co-locate the DB with inference GPUs.
- MCP is the standard for model-tool calls. MCP servers should be local to inference.
- Inference fabric ≠ training fabric. Most inference is fine on standard Ethernet.
- Decisions: share fabric with training? disaggregate prefill/decode? where does the vector DB live?
- At scale, most operators run separate fabrics for training and inference.
- Watch TTFT, TPOT, and KV-cache hit rate — the golden signals for inference.
Next: Production Operations → — running this in production. RCA, telemetry, common failure modes, the playbooks that survive 3 AM.