Skip to main content

RAG, MCP, and Inference Fabric Design

This page is the synthesis. RAG and MCP add traffic patterns that didn't exist a few years ago. The inference fabric you build today has to accommodate them — and decide what to share with the training fabric.


RAG — Retrieval-Augmented Generation

RAG is the dominant pattern for inference with external knowledge. Instead of asking a model "what's our company's vacation policy?" and hoping it memorized it, you:

  1. Embed the user query into a vector
  2. Search a vector database for the most similar documents
  3. Augment the prompt with those documents
  4. Generate the answer with the augmented context

The architecture:

User → Frontend

├── Embed (small model, ~50ms)

├── Vector DB query (Pinecone / Milvus / pgvector, ~20-50ms)
│ ↓ top-K results

├── Augment prompt (concatenate doc text + user query)

└── Inference (the big model, the slow part)

└── Response stream

Network impact:

  • Embedding model is usually small (BERT-class, 100M-1B params) — runs on a fraction of one GPU. Often co-located with the frontend.
  • Vector DB hits are small but latency-sensitive: ~1-5 KB per query, but each ms adds to TTFT.
  • The augmented prompt is larger (e.g., 4 docs × 2 KB = 8 KB extra context), making prefill take slightly longer.

Fabric implication: RAG wants the vector DB and the inference GPUs on a low-latency path — same datacenter, ideally same rack-level network. Cross-region RAG is painful.


MCP — Model Context Protocol

MCP (Model Context Protocol, Anthropic 2024) is a standard for connecting AI models to tools, data sources, and other systems. Think of it as "an HTTP for AI tool-use."

When a model calls a tool (database query, code execution, web search), the call goes through an MCP server. The model produces a tool-call request → MCP server executes → MCP server returns the result → model continues generating.

Traffic pattern:

  • Request/response, much like HTTP
  • Small payloads (JSON, a few KB)
  • Multi-turn — one inference can trigger 5-20 tool calls
  • Latency-sensitive — each tool call blocks the model

Fabric implication: MCP servers should be close to the inference cluster. Same datacenter, low-latency path. If MCP calls cross WAN, every tool call adds tens of ms.


Designing an inference fabric

You have a few architectural decisions to make.

Decision 1: Share fabric with training?

Option A: One big fabric, training and inference both on it.

Pros: simpler, fewer cables, infrastructure team manages one thing. Cons: training's elephant flows will starve inference's latency-critical small flows. PFC/ECN tuned for training is wrong for inference.

Option B: Separate fabrics.

Pros: each fabric tuned for its workload. Inference can run on lower-cost gear. Cons: more infrastructure, more teams.

Option C: Same physical fabric, different priority classes / VLANs.

Pros: shared hardware, isolated traffic. Best of both worlds in theory. Cons: requires careful QoS engineering; hard to get right at scale.

Most large operators do B for the GPU fabric and share C for management / storage. Inference clusters run on standard DC Ethernet with optional RDMA islands.

Decision 2: Disaggregate prefill / decode?

This is the newer choice. If you do:

  • Prefill pool — high FLOPs density (H100/B200), few cards needed per request, sustained throughput pattern
  • Decode pool — high memory bandwidth (also H100/B200), more cards per request, latency-sensitive
  • Inter-pool network — RDMA, big enough to move multi-GB KV-cache per request

If you don't disaggregate, you co-locate prefill and decode on the same GPUs. Simpler operationally, less GPU utilization optimization.

Decision 3: Where does the vector DB live?

For RAG, the vector DB is hot. Options:

  • Same cluster as inference — lowest latency, smaller scale.
  • Adjacent cluster, same DC — common pattern for prod RAG at scale.
  • Managed service (Pinecone, etc.) — easiest to operate, adds WAN latency.

For high-throughput RAG, you want the vector DB co-located. The first ms of RAG latency hurts user experience visibly.

Decision 4: Edge vs centralized inference?

For consumer-facing applications (chatbots, search), you might run inference replicas in multiple regions to reduce user-perceived latency. This adds:

  • Model replication across regions (one-time cost per release)
  • Routing decisions (which region serves which user)
  • KV-cache locality (sticky sessions per user-region pair)

Most large inference workloads are still centralized — the math says it's cheaper to serve everyone from a few mega-clusters than to distribute. But this is changing as models get smaller and edge GPUs get more capable.


What this curriculum picks

For an inference fabric we'd recommend by default:

  • Standard DC Ethernet for the inference pool (no RoCE v2 unless you have multi-host models or disaggregated serving)
  • 100/200 G per server (plenty for token streaming and RAG)
  • Vector DB co-located with inference GPUs, same datacenter
  • L7 load balancer (Envoy, HAProxy) for request routing — not a hardware LB
  • Sticky sessions at the routing layer for KV-cache locality
  • Telemetry per-replica TTFT, TPOT, and KV-cache hit rate — these are your golden signals

If you grow into multi-host models or disaggregated prefill/decode, that's when you add RDMA islands for the specific traffic that needs it.


What you should remember

  • RAG adds vector DB traffic — small but latency-sensitive. Co-locate the DB with inference GPUs.
  • MCP is the standard for model-tool calls. MCP servers should be local to inference.
  • Inference fabric ≠ training fabric. Most inference is fine on standard Ethernet.
  • Decisions: share fabric with training? disaggregate prefill/decode? where does the vector DB live?
  • At scale, most operators run separate fabrics for training and inference.
  • Watch TTFT, TPOT, and KV-cache hit rate — the golden signals for inference.

Next: Production Operations → — running this in production. RCA, telemetry, common failure modes, the playbooks that survive 3 AM.