Skip to main content

Rail-Optimized Topology

In a traditional DC, each server has one or two NICs, both connected to the same ToR. The ToR aggregates the whole rack.

An AI training server has 8 NICs, and in a rail-optimized topology, each NIC goes to its own dedicated leaf switch. That's the structural twist that makes AI fabrics look weird at first sight. This page explains why.


What is a rail?

A rail is one independent fabric, end to end:

  • All the NIC-0s in the cluster connect to Rail Leaf 0.
  • All the NIC-1s connect to Rail Leaf 1.
  • ...all the NIC-7s connect to Rail Leaf 7.

So an 8-NIC-per-server cluster has 8 separate rail fabrics, each one operationally independent. They share the same spine layer (or have their own spine pairs in larger designs), but at the leaf they're isolated.

Rail-optimized topology diagram. Two servers each with 8 GPUs and 8 NICs. NIC 0 from every server connects to Rail Leaf 0, NIC 1 to Rail Leaf 1, etc. The orange-highlighted rail (Rail 0) shows that GPU 0 on Server A talks to GPU 0 on Server B through Rail Leaf 0 only — no contention from other GPUs.
Rail 0 connects every server's NIC 0 to Rail Leaf 0. GPU 0 on Server A → GPU 0 on Server B uses only Rail 0 — no contention with other GPUs in the same server.

Why rails exist

The naive design — all 8 NICs to one ToR — has two problems:

  1. The ToR becomes a bottleneck. 8 × 400G = 3.2 Tbps of GPU egress into a single switch. The switch has to fan out 3.2 Tbps of uplinks to the spine layer. That's an absurd radix.
  2. Cross-GPU traffic on the same server uses the ToR. GPU 0 on Server A talking to GPU 0 on Server B goes through the same ToR as GPU 0 on Server A talking to GPU 7 on Server A. Hot spot.

The rail-optimized design fixes both:

  1. Each rail leaf only sees 1/8 of any one server's traffic — smaller radix per switch.
  2. Cross-GPU traffic at the same index (GPU 0 ↔ GPU 0) flows on its own rail. No contention from other GPUs on the same server.

The key insight: in a distributed training job, GPU 0 talks to GPU 0 (peer-to-peer for that rank). GPU 1 talks to GPU 1. Same index = same rail = isolated path.


Blast radius

Rails change failure modes in two ways.

What gets better

  • A single rail-leaf failure loses 1/8 of each server's bandwidth. The job slows to 7/8 speed but keeps running. Graceful degradation.
  • No cross-rail blast radius. A storm on Rail 3 doesn't touch Rails 0, 1, 2, 4, 5, 6, 7.

What gets worse

  • A whole-server failure still loses all 8 GPUs at once (NICs aren't shared across servers).
  • Spine failure affects all rails simultaneously (if spines are shared across rails).

Operational implication

You can drain one rail at a time for maintenance, and the job survives. That changes how you do upgrades.


Pod sizing

A pod is a self-contained training fabric — typically sized so the topology is non-blocking at the rated GPU count.

Common sizes:

Pod sizeTypical structure
128 GPUs16 servers × 8 GPUs · 8 rail leaves · 2-tier (no spine)
256 GPUs32 servers × 8 GPUs · 8 rail leaves × 8 ports each · spine layer optional
512–2048 GPUsMulti-pod with 3-tier (leaf + spine + super-spine)
10K+ GPUsMultiple pods stitched across super-spines; sometimes with intermediate aggregation

The super-spine is the tier that connects pods. It's where you'd cross-pod for very large training jobs (>1 pod size). Inter-pod traffic is slower (more hops) so the scheduler tries to keep jobs within a pod.


Pod ↔ rail interaction

Two combinations you'll see:

  1. One pod per rail. Each rail has its own pod. Pod 0 = all rail 0 leaves and their spines. Job runs across "pods" but each pod is one rail. Simplifies operations.
  2. Rails span multiple pods. Rail 0 has multiple sub-pods, connected through a super-spine. More common at very large scale.

Either way, the GPU 0 ↔ GPU 0 dedicated path property holds within a single rail's reach.


What you should remember

  • One rail per GPU index — NIC 0 to Rail Leaf 0, NIC 1 to Rail Leaf 1, etc.
  • 8-GPU server = 8 NICs = 8 rails (the dominant pattern).
  • Same-index traffic stays on the same rail — GPU 0 ↔ GPU 0 sees no contention from other GPUs on the same server.
  • Single rail failure = 1/8 bandwidth loss, job continues at 7/8 speed. Graceful.
  • Pods size in powers of 2 — 128, 256, 512, 1024, 2048 — chosen so the topology stays non-blocking.

Next: Hash Polarization & Elephant Flows → — why traditional ECMP breaks under synchronized collectives, and what hyperscalers do about it.