Rail-Optimized Topology

In a traditional DC, each server has one or two NICs, both connected to the same ToR. The ToR aggregates the whole rack.

An AI training server has 8 NICs, and in a rail-optimized topology, each NIC goes to its own dedicated leaf switch. That's the structural twist that makes AI fabrics look weird at first sight. This page explains why.

What is a rail?

A rail is one independent fabric, end to end:

All the NIC-0s in the cluster connect to Rail Leaf 0.
All the NIC-1s connect to Rail Leaf 1.
...all the NIC-7s connect to Rail Leaf 7.

So an 8-NIC-per-server cluster has 8 separate rail fabrics, each one operationally independent. They share the same spine layer (or have their own spine pairs in larger designs), but at the leaf they're isolated.

Rail-optimized topology diagram. Two servers each with 8 GPUs and 8 NICs. NIC 0 from every server connects to Rail Leaf 0, NIC 1 to Rail Leaf 1, etc. The orange-highlighted rail (Rail 0) shows that GPU 0 on Server A talks to GPU 0 on Server B through Rail Leaf 0 only — no contention from other GPUs. — Rail 0 connects every server's NIC 0 to Rail Leaf 0. GPU 0 on Server A → GPU 0 on Server B uses only Rail 0 — no contention with other GPUs in the same server.

Why rails exist

The naive design — all 8 NICs to one ToR — has two problems:

The ToR becomes a bottleneck. 8 × 400G = 3.2 Tbps of GPU egress into a single switch. The switch has to fan out 3.2 Tbps of uplinks to the spine layer. That's an absurd radix.
Cross-GPU traffic on the same server uses the ToR. GPU 0 on Server A talking to GPU 0 on Server B goes through the same ToR as GPU 0 on Server A talking to GPU 7 on Server A. Hot spot.

The rail-optimized design fixes both:

Each rail leaf only sees 1/8 of any one server's traffic — smaller radix per switch.
Cross-GPU traffic at the same index (GPU 0 ↔ GPU 0) flows on its own rail. No contention from other GPUs on the same server.

The key insight: in a distributed training job, GPU 0 talks to GPU 0 (peer-to-peer for that rank). GPU 1 talks to GPU 1. Same index = same rail = isolated path.

Blast radius

Rails change failure modes in two ways.

What gets better

A single rail-leaf failure loses 1/8 of each server's bandwidth. The job slows to 7/8 speed but keeps running. Graceful degradation.
No cross-rail blast radius. A storm on Rail 3 doesn't touch Rails 0, 1, 2, 4, 5, 6, 7.

What gets worse

A whole-server failure still loses all 8 GPUs at once (NICs aren't shared across servers).
Spine failure affects all rails simultaneously (if spines are shared across rails).

Operational implication

You can drain one rail at a time for maintenance, and the job survives. That changes how you do upgrades.

Pod sizing

A pod is a self-contained training fabric — typically sized so the topology is non-blocking at the rated GPU count.

Common sizes:

Pod size	Typical structure
128 GPUs	16 servers × 8 GPUs · 8 rail leaves · 2-tier (no spine)
256 GPUs	32 servers × 8 GPUs · 8 rail leaves × 8 ports each · spine layer optional
512–2048 GPUs	Multi-pod with 3-tier (leaf + spine + super-spine)
10K+ GPUs	Multiple pods stitched across super-spines; sometimes with intermediate aggregation

The super-spine is the tier that connects pods. It's where you'd cross-pod for very large training jobs (>1 pod size). Inter-pod traffic is slower (more hops) so the scheduler tries to keep jobs within a pod.

Pod ↔ rail interaction

Two combinations you'll see:

One pod per rail. Each rail has its own pod. Pod 0 = all rail 0 leaves and their spines. Job runs across "pods" but each pod is one rail. Simplifies operations.
Rails span multiple pods. Rail 0 has multiple sub-pods, connected through a super-spine. More common at very large scale.

Either way, the GPU 0 ↔ GPU 0 dedicated path property holds within a single rail's reach.

What you should remember

One rail per GPU index — NIC 0 to Rail Leaf 0, NIC 1 to Rail Leaf 1, etc.
8-GPU server = 8 NICs = 8 rails (the dominant pattern).
Same-index traffic stays on the same rail — GPU 0 ↔ GPU 0 sees no contention from other GPUs on the same server.
Single rail failure = 1/8 bandwidth loss, job continues at 7/8 speed. Graceful.
Pods size in powers of 2 — 128, 256, 512, 1024, 2048 — chosen so the topology stays non-blocking.

Next: Hash Polarization & Elephant Flows → — why traditional ECMP breaks under synchronized collectives, and what hyperscalers do about it.

What is a rail?​

Why rails exist​

Blast radius​

What gets better​

What gets worse​

Operational implication​

Pod sizing​

Pod ↔ rail interaction​

What you should remember​