RAG, MCP, and Inference Fabric Design

This page is the synthesis. RAG and MCP add traffic patterns that didn't exist a few years ago. The inference fabric you build today has to accommodate them — and decide what to share with the training fabric.

RAG — Retrieval-Augmented Generation

RAG is the dominant pattern for inference with external knowledge. Instead of asking a model "what's our company's vacation policy?" and hoping it memorized it, you:

Embed the user query into a vector
Search a vector database for the most similar documents
Augment the prompt with those documents
Generate the answer with the augmented context

The architecture:

User → Frontend
         │
         ├── Embed (small model, ~50ms)
         │
         ├── Vector DB query (Pinecone / Milvus / pgvector, ~20-50ms)
         │     ↓ top-K results
         │
         ├── Augment prompt (concatenate doc text + user query)
         │
         └── Inference (the big model, the slow part)
             │
             └── Response stream

Network impact:

Embedding model is usually small (BERT-class, 100M-1B params) — runs on a fraction of one GPU. Often co-located with the frontend.
Vector DB hits are small but latency-sensitive: ~1-5 KB per query, but each ms adds to TTFT.
The augmented prompt is larger (e.g., 4 docs × 2 KB = 8 KB extra context), making prefill take slightly longer.

Fabric implication: RAG wants the vector DB and the inference GPUs on a low-latency path — same datacenter, ideally same rack-level network. Cross-region RAG is painful.

MCP — Model Context Protocol

MCP (Model Context Protocol, Anthropic 2024) is a standard for connecting AI models to tools, data sources, and other systems. Think of it as "an HTTP for AI tool-use."

When a model calls a tool (database query, code execution, web search), the call goes through an MCP server. The model produces a tool-call request → MCP server executes → MCP server returns the result → model continues generating.

Traffic pattern:

Request/response, much like HTTP
Small payloads (JSON, a few KB)
Multi-turn — one inference can trigger 5-20 tool calls
Latency-sensitive — each tool call blocks the model

Fabric implication: MCP servers should be close to the inference cluster. Same datacenter, low-latency path. If MCP calls cross WAN, every tool call adds tens of ms.

Designing an inference fabric

You have a few architectural decisions to make.

Option A: One big fabric, training and inference both on it.

Pros: simpler, fewer cables, infrastructure team manages one thing. Cons: training's elephant flows will starve inference's latency-critical small flows. PFC/ECN tuned for training is wrong for inference.

Option B: Separate fabrics.

Pros: each fabric tuned for its workload. Inference can run on lower-cost gear. Cons: more infrastructure, more teams.

Option C: Same physical fabric, different priority classes / VLANs.

Pros: shared hardware, isolated traffic. Best of both worlds in theory. Cons: requires careful QoS engineering; hard to get right at scale.

Most large operators do B for the GPU fabric and share C for management / storage. Inference clusters run on standard DC Ethernet with optional RDMA islands.

Decision 2: Disaggregate prefill / decode?

This is the newer choice. If you do:

Prefill pool — high FLOPs density (H100/B200), few cards needed per request, sustained throughput pattern
Decode pool — high memory bandwidth (also H100/B200), more cards per request, latency-sensitive
Inter-pool network — RDMA, big enough to move multi-GB KV-cache per request

If you don't disaggregate, you co-locate prefill and decode on the same GPUs. Simpler operationally, less GPU utilization optimization.

Decision 3: Where does the vector DB live?

For RAG, the vector DB is hot. Options:

Same cluster as inference — lowest latency, smaller scale.
Adjacent cluster, same DC — common pattern for prod RAG at scale.
Managed service (Pinecone, etc.) — easiest to operate, adds WAN latency.

For high-throughput RAG, you want the vector DB co-located. The first ms of RAG latency hurts user experience visibly.

Decision 4: Edge vs centralized inference?

For consumer-facing applications (chatbots, search), you might run inference replicas in multiple regions to reduce user-perceived latency. This adds:

Model replication across regions (one-time cost per release)
Routing decisions (which region serves which user)
KV-cache locality (sticky sessions per user-region pair)

Most large inference workloads are still centralized — the math says it's cheaper to serve everyone from a few mega-clusters than to distribute. But this is changing as models get smaller and edge GPUs get more capable.

What this curriculum picks

For an inference fabric we'd recommend by default:

Standard DC Ethernet for the inference pool (no RoCE v2 unless you have multi-host models or disaggregated serving)
100/200 G per server (plenty for token streaming and RAG)
Vector DB co-located with inference GPUs, same datacenter
L7 load balancer (Envoy, HAProxy) for request routing — not a hardware LB
Sticky sessions at the routing layer for KV-cache locality
Telemetry per-replica TTFT, TPOT, and KV-cache hit rate — these are your golden signals

If you grow into multi-host models or disaggregated prefill/decode, that's when you add RDMA islands for the specific traffic that needs it.

What you should remember

RAG adds vector DB traffic — small but latency-sensitive. Co-locate the DB with inference GPUs.
MCP is the standard for model-tool calls. MCP servers should be local to inference.
Inference fabric ≠ training fabric. Most inference is fine on standard Ethernet.
Decisions: share fabric with training? disaggregate prefill/decode? where does the vector DB live?
At scale, most operators run separate fabrics for training and inference.
Watch TTFT, TPOT, and KV-cache hit rate — the golden signals for inference.

Next: Production Operations → — running this in production. RCA, telemetry, common failure modes, the playbooks that survive 3 AM.

RAG — Retrieval-Augmented Generation​

MCP — Model Context Protocol​

Designing an inference fabric​

Decision 1: Share fabric with training?​

Decision 2: Disaggregate prefill / decode?​

Decision 3: Where does the vector DB live?​

Decision 4: Edge vs centralized inference?​

What this curriculum picks​

What you should remember​