Rail-Optimized Topology
In a traditional DC, each server has one or two NICs, both connected to the same ToR. The ToR aggregates the whole rack.
An AI training server has 8 NICs, and in a rail-optimized topology, each NIC goes to its own dedicated leaf switch. That's the structural twist that makes AI fabrics look weird at first sight. This page explains why.
What is a rail?
A rail is one independent fabric, end to end:
- All the NIC-0s in the cluster connect to Rail Leaf 0.
- All the NIC-1s connect to Rail Leaf 1.
- ...all the NIC-7s connect to Rail Leaf 7.
So an 8-NIC-per-server cluster has 8 separate rail fabrics, each one operationally independent. They share the same spine layer (or have their own spine pairs in larger designs), but at the leaf they're isolated.
Why rails exist
The naive design — all 8 NICs to one ToR — has two problems:
- The ToR becomes a bottleneck. 8 × 400G = 3.2 Tbps of GPU egress into a single switch. The switch has to fan out 3.2 Tbps of uplinks to the spine layer. That's an absurd radix.
- Cross-GPU traffic on the same server uses the ToR. GPU 0 on Server A talking to GPU 0 on Server B goes through the same ToR as GPU 0 on Server A talking to GPU 7 on Server A. Hot spot.
The rail-optimized design fixes both:
- Each rail leaf only sees
1/8of any one server's traffic — smaller radix per switch. - Cross-GPU traffic at the same index (GPU 0 ↔ GPU 0) flows on its own rail. No contention from other GPUs on the same server.
The key insight: in a distributed training job, GPU 0 talks to GPU 0 (peer-to-peer for that rank). GPU 1 talks to GPU 1. Same index = same rail = isolated path.
Blast radius
Rails change failure modes in two ways.
What gets better
- A single rail-leaf failure loses
1/8of each server's bandwidth. The job slows to 7/8 speed but keeps running. Graceful degradation. - No cross-rail blast radius. A storm on Rail 3 doesn't touch Rails 0, 1, 2, 4, 5, 6, 7.
What gets worse
- A whole-server failure still loses all 8 GPUs at once (NICs aren't shared across servers).
- Spine failure affects all rails simultaneously (if spines are shared across rails).
Operational implication
You can drain one rail at a time for maintenance, and the job survives. That changes how you do upgrades.
Pod sizing
A pod is a self-contained training fabric — typically sized so the topology is non-blocking at the rated GPU count.
Common sizes:
| Pod size | Typical structure |
|---|---|
| 128 GPUs | 16 servers × 8 GPUs · 8 rail leaves · 2-tier (no spine) |
| 256 GPUs | 32 servers × 8 GPUs · 8 rail leaves × 8 ports each · spine layer optional |
| 512–2048 GPUs | Multi-pod with 3-tier (leaf + spine + super-spine) |
| 10K+ GPUs | Multiple pods stitched across super-spines; sometimes with intermediate aggregation |
The super-spine is the tier that connects pods. It's where you'd cross-pod for very large training jobs (>1 pod size). Inter-pod traffic is slower (more hops) so the scheduler tries to keep jobs within a pod.
Pod ↔ rail interaction
Two combinations you'll see:
- One pod per rail. Each rail has its own pod. Pod 0 = all rail 0 leaves and their spines. Job runs across "pods" but each pod is one rail. Simplifies operations.
- Rails span multiple pods. Rail 0 has multiple sub-pods, connected through a super-spine. More common at very large scale.
Either way, the GPU 0 ↔ GPU 0 dedicated path property holds within a single rail's reach.
What you should remember
- One rail per GPU index — NIC 0 to Rail Leaf 0, NIC 1 to Rail Leaf 1, etc.
- 8-GPU server = 8 NICs = 8 rails (the dominant pattern).
- Same-index traffic stays on the same rail — GPU 0 ↔ GPU 0 sees no contention from other GPUs on the same server.
- Single rail failure =
1/8bandwidth loss, job continues at 7/8 speed. Graceful. - Pods size in powers of 2 — 128, 256, 512, 1024, 2048 — chosen so the topology stays non-blocking.
Next: Hash Polarization & Elephant Flows → — why traditional ECMP breaks under synchronized collectives, and what hyperscalers do about it.