Skip to main content

Design Options — the landscape

Now you understand why the fabric is different. Next question: what shape do you build?

The industry has converged on four patterns. Each one optimizes a different axis — cost, scale, predictability, or operational simplicity — and each one has a sweet spot in cluster size. This page is the map: what each pattern is, when each one wins, and how to pick. The deep dives sit in the next two pages.

One-page reference titled "AI Fabric Design Options: The Landscape" comparing the four fabric design patterns. Top section: "The Four Patterns at a Glance" table with rows for ROD (Rail-Optimized: graceful 7/8 degradation, default for serious training, 1K–10K GPUs), RUD (Rail-Unified: all NICs on one ToR, hash polarization risks, under 256 GPUs), Scheduled Fabric (microsecond-bounded latency, when predictability matters, 1K to 8K GPUs), and Multi-Planar (separate leaves/spines/super-spines per rail, hyperscale fabrics, above 16K GPUs). Middle section: four numbered cards illustrating each pattern with a small topology diagram — ROD with 2 GPUs and NICs on Leaf 0/Leaf 1 ("Same-index GPUs share a rail"), RUD with NICs collapsing to one ToR Switch, Scheduled with packets sliced into cells and reassembled ("End-to-end credit scheduling"), and Multi-Planar with N parallel planes ("N parallel fabrics end-to-end"). Bottom section: "Multi-Tenancy: When Tenants Share the Fabric" with three columns — Separate Physical Fabrics (works for hyperscale CapEx, breaks for small clusters), VRF-per-Tenant (works for trusted inference, breaks for training because VRF doesn't isolate PFC/buffers), VXLAN-EVPN Overlay (works for front-end storage/mgmt, breaks for back-end due to latency overhead). Footer rule: at hyperscale, dedicated back-end fabrics per tenant with shared front-end; collectives don't share well.

The whole landscape on one card. The four sections below walk each pattern in depth.


1. The four patterns at a glance

PatternOne-line descriptionWhen it winsTypical scale
Rail-Optimized Design (ROD)Each GPU NIC lands on its own dedicated rail leaf; same-index GPUs share a rail.Default for serious training clusters.1K–10K GPUs
Rail-Unified Design (RUD)All NICs on one ToR per host — classic leaf-spine, just non-blocking.Cost-sensitive, small clusters, or inference.< 256 GPUs
Scheduled Fabric (DDC / VOQ)Cell-based forwarding with end-to-end credit scheduling — no hashing, no drops.Predictability matters more than vendor freedom.1K–8K GPUs
Multi-Planar Scale-OutN parallel fabrics, each owning one rail end-to-end.Hyperscale where single-fabric ECMP polarization breaks down.> 16K GPUs

2. ROD — the default

Each GPU NIC lands on its own dedicated rail leaf. NIC-0 across every host shares Rail Leaf 0, NIC-7 shares Rail Leaf 7 — eight rails, eight independent fabrics. Same-index collectives never compete with other GPUs on the host, ECMP polarization drops sharply, and you get graceful 7/8 bandwidth degradation on a leaf failure. This is the default you'll see at 1K–10K GPUs across NVIDIA reference designs, Meta, and most enterprise builds.

Deep dive on the next page: 3.3 Rail-Optimized Design (ROD) →.


3. RUD — the budget alternative

All NICs on one ToR per host. It's the classic leaf-spine you already know, just non-blocking. The fabric BOM is 30–40% cheaper because you collapse 8 rail leaves into 1 ToR and cut cabling 8×. The cost: every cross-server collective crosses the spine, which means hash polarization and elephant-flow pain show up early. RUD wins below ~256 GPUs; it loses above ~512 when collective traffic starts saturating spine uplinks.


4. Scheduled Fabric — replace hashing with credits

Instead of hashing flows across links and praying, the destination switch grants credits before traffic flows. Packets get sliced into fixed-size cells, sprayed across all paths, reassembled at the egress. Result: no hash polarization, zero drops by design, and microsecond-bounded tail latency. The cost is ~1–5 µs per hop of scheduling overhead and a vendor-specific silicon dependency (Broadcom Jericho3-AI, Cisco Silicon One G200). Wins when predictability matters more than vendor freedom.


5. Multi-Planar — for hyperscale

N parallel fabrics, each owning one rail end-to-end — separate leaves, separate spines, separate super-spines. Used past ~16K GPUs when single-fabric ECMP polarization becomes unmanageable no matter how good your adaptive routing is. NVIDIA DGX SuperPOD (8 planes), Meta GenAI clusters, and Google TPU pods (via optical circuit switches) all use multi-planar variants. The trade-off is N× switches, N× cabling, N× management surface — only worth it when the alternative is throughput loss to polarization.


6. Multi-tenancy — when tenants share the fabric

This question comes up early and the honest answer is uncomfortable: the back-end RoCE fabric is almost never multi-tenant in practice. The latency budget kills it. When operators do share, three patterns show up:

PatternWorks whenBreaks when
Separate physical fabrics per tenantHyperscale, big tenants, you can afford the CapEx.Smaller clusters where the per-tenant fabric sits idle most of the day.
VRF-per-tenant on a shared fabricSmall clusters, trusted tenants, mostly inference.Training collectives — VRF doesn't isolate PFC or shared buffer pressure across tenants.
VXLAN-EVPN overlayStandard on front-end (storage, mgmt, ingress).Almost never on the back-end — encap adds latency and most NICs don't offload it cleanly for RoCE.

The pattern at hyperscale is dedicated back-end fabrics per tenant, shared front-end via VXLAN-EVPN. Don't try to be clever here — collectives don't share well.


7. Pick your design

The decision is mostly driven by cluster size, then workload mix (training vs inference), then vendor preference and budget. Here's the short version you can use right now:

  • < 256 GPUs?RUD. Cheap, simple, fast enough.
  • 256–1K GPUs?ROD. Worth the cabling complexity at this scale.
  • 1K–10K GPUs?ROD + adaptive routing (or packet spraying). ECMP alone won't cut it.
  • > 16K GPUs?Multi-planar. Single-fabric polarization stops being fixable.
  • Cost-sensitive at 1K–4K AND you're OK with Broadcom-only?Scheduled fabric. Predictability for the price of vendor lock-in.
Anti-pattern

Co-mingling the back-end (RoCE) and front-end (storage + management) on the same switches. It looks like you're saving ports and CapEx — until the first training job runs and storage traffic shares buffers with collective traffic, PFC pauses propagate across both planes, and a benign storage hiccup stalls a 1K-GPU job. Two separate physical fabrics, always.


💡 What you should remember

🗺️Four patterns, one decisionROD, RUD, Scheduled, Multi-Planar. The choice is driven by cluster size first, then workload and budget.
🎯ROD is the defaultAt 1K–10K GPUs with serious training, rail-optimized wins almost every time. Everything else is a deviation with a reason.
💰RUD is the budget escape hatch30–40% cheaper fabric BOM, works below ~256 GPUs, falls apart above ~512.
🎟️Scheduled fabric trades flexibility for predictabilityCredit-based, zero-drop, 1–5 µs scheduling overhead, locked to Broadcom/Cisco silicon.
🛤️Multi-planar is for past 16K GPUsWhen polarization stops being fixable, build N parallel fabrics. N× everything.
🚫Back-end RoCE is not multi-tenantSeparate physical fabrics per tenant at hyperscale, VRF for small/trusted clusters, EVPN overlay only on the front-end.
Never collapse back-end and front-endOne fabric for collectives, one for everything else. Always.

Next: Rail-Optimized Design (ROD) → — the deep dive on the default pattern: what wires to what, why blast radius gets better and worse at the same time, and how to size a pod.