Cluster Sizing & Cabling
You've picked the design (3.2) and the deep-dive (3.3 — ROD). Now: how many leaves? How many spines? How much fiber? This page is the BOM math and the day-1 install reality. Vendor stacks — Spectrum-X, Tomahawk, Jericho3-AI, Silicon One, Arista — live on the blog.
1. Reference designs at each scale
The numbers below assume 400G per GPU NIC and 8 GPUs per server — the modern H100/H200/B200 baseline. Swap to 800G and the GPU counts double for the same switch radix.
256 GPUs — single pod, single leaf per rail
32 servers × 8 GPUs. Eight rail leaves at 32 × 400G each. NIC 0 from all 32 servers lands on Rail Leaf 0; NIC 1 on Rail Leaf 1; and so on. Each rail terminates on one leaf, so AllReduce traffic for any GPU rank never crosses a switch boundary — no spine is required for the dominant traffic pattern. Caveat: if your collectives do significant cross-rail traffic (AllToAll, MoE all-to-all), bolt on a small spine pair so the inter-rail hop has a real path. Pure AllReduce / AllGather workloads do not need it at this scale.
Aggregate host-side bandwidth: 256 × 400G = 102 Tbps (256 GPUs × 1 NIC). One bad leaf knocks out 32 GPUs (one per server) — degraded but the job can keep training on the remaining 7 rails.
1024 GPUs — single pod, 2-tier
128 servers × 8 GPUs. Eight rails of 4 leaves each = 32 leaves total. 8 spines. Each leaf: 32 × 400G down (servers) + 16 × 400G up (spine) = 1:2 oversub at the leaf. Some shops push for 1:1 here by going to R = 64 leaves — your call, your budget.
Spine fanout: each spine takes 4 uplinks from every leaf in its rail (4 leaves × 8 rails = 32 ports). One spine = 32 × 400G. Sized to a Tomahawk 4 (25.6 Tbps) or half a Tomahawk 5.
4096 GPUs — multi-pod, 3-tier
512 servers. 4 pods × 1024 GPUs. 128 leaves, 32 spines, 16 super-spines. A pod-local job (and most jobs are scheduled this way) stays 2 hops: leaf → spine → leaf. Cross-pod adds the super-spine hop — 4 hops worst case.
This is where rail-affinity scheduling starts paying for itself. If the scheduler keeps a 1K job inside one pod, you never touch the super-spine layer and you save the cross-pod tail latency.
16384 GPUs — multi-pod, near the limit
2048 servers. 16 pods × 1024 GPUs. ~512 leaves, 128 spines, 64 super-spines. You are approaching the single-fabric ECMP limits — hash polarization across 128-way fanout gets ugly fast, and link-failure blast radius starts to matter.
Start evaluating multi-planar here. A single 16K fabric still works, but 32K with the same shape will not.
100K+ GPUs — multi-planar territory
Single CLOS no longer works. 8 parallel planes, each one a full ~16K-GPU fabric. Each NIC index goes to its own plane — NIC 0 to Plane 0, NIC 1 to Plane 1, all the way to NIC 7 to Plane 7. Planes never talk to each other; the collective library knows which NIC is on which plane and steers traffic accordingly. (Cross-reference: see the Multi-Planar pattern in 3.2 Design Options.)
2. Switch radix math
The whole BOM falls out of one formula. A leaf has R ports. If R = 64 and each port is 400G:
- 32 ports down (servers) + 32 ports up (spine) = 1:1 oversub at the leaf.
- Pod size for one rail at 1:1 = R/2 servers. For
R = 64: 32 servers per rail × 8 rails = 256 servers / 2048 GPUs per pod. - Spine ports needed per rail = (leaves per rail) × (uplink ports per leaf). Spine radix sets the cap on leaves per rail.
Chip cheat-sheet for the current generation:
| Silicon | Capacity | Port options |
|---|---|---|
| Broadcom Tomahawk 4 | 25.6 Tbps | 64 × 400G or 32 × 800G |
| Broadcom Tomahawk 5 | 51.2 Tbps | 128 × 400G or 64 × 800G |
| NVIDIA Spectrum-4 | 51.2 Tbps | 64 × 800G (OSFP) |
Doubling the silicon doubles the radix, which roughly quadruples the GPUs you can host in a 2-tier pod before you're forced into a third tier.
3. Transceivers — OSFP, AOC, DAC
You will buy more optics than switches. Pick by reach, not by price alone — a $40 DAC that doesn't reach the spine row is a $40 paperweight.
| Type | Reach | Cost | When |
|---|---|---|---|
| DAC (Direct Attach Copper) | ≤ 3 m | Cheapest | Intra-rack: leaf → server in same rack |
| AOC (Active Optical Cable) | ≤ 30 m | Middle | Rack-to-adjacent-rack: leaf → server, or leaf → spine in same row |
| OSFP MMF (Multi-Mode Fiber) | ≤ 100 m | Higher | Across the row: spine → super-spine |
| OSFP SMF (Single-Mode Fiber) | km-class | Highest | Cross-DC, cross-building |
A note on form factor: OSFP and QSFP-DD are both 8-lane 400G/800G modules, mechanically incompatible. NVIDIA Spectrum-4 is OSFP. Many Broadcom Tomahawk designs ship as QSFP-DD. Mixing vendors means an adapter cage or a parallel optic SKU — plan it in the BOM.
4. Cabling for rail-optimized — the labelling scheme
Each server has 8 cables going to 8 different leaves. That is the structural fact you keep in your head. Mislabeling is the day-1 install bug — every cluster build hits it, the only question is how fast you find it.
Standard scheme: <server-id>:<nic-index> → <rail-leaf-id>:<port>. Example: srv-04:nic3 → rail-leaf-3:port-12. Both ends of the cable get the same label, printed twice, heat-shrunk.
Layer a colour code on top of that — Rail 0 = yellow boot, Rail 1 = blue, Rail 2 = green, and so on through 8 distinct colours. The colour catches the gross mis-rack ("why is there a yellow boot on Rail Leaf 3?") before you ever boot a host.
Run a cabling checker script post-install: each host pings its expected rail leaf on every NIC and refuses to boot the workload if NIC 3 lands on Rail 5 instead of Rail 3. Cheap to write, saves a week of NCCL debugging.
Spare 10–20% transceivers from day 1. At 16K GPUs you're racking ~32,000 optics — even at a 0.1% annual failure rate that's 32 modules a year, and they always fail on a Friday.
Skipping the cable-labelling scheme to save time on day 1. You'll spend the same time × 10 finding which NIC went to the wrong rail when the first AllReduce hangs and a thousand GPUs sit idle while you trace fiber under a raised floor. Label before you rack.
💡 What you should remember
| # | Concept | Why it matters | |
|---|---|---|---|
| 1 | 📐 | R/2 servers per rail at 1:1 | The whole BOM rolls up from leaf radix. R=64 → 32 servers per rail × 8 rails = 256 servers / 2048 GPUs per pod at 1:1. |
| 2 | 🏗️ | 3 tiers buy you ~4K–16K GPUs | Past 16K, single-fabric ECMP cracks. Move to multi-planar. |
| 3 | 🔌 | DAC inside the rack, AOC across the row, MMF across the hall | Reach picks the optic, not price. |
| 4 | 🏷️ | Label both ends of every cable | 8 cables per server × 2048 servers = 16K chances to get it wrong. |
| 5 | 🎨 | Colour-code by rail | The first-pass visual catch before any host boots. |
| 6 | 📦 | Spare 10–20% optics | Optical failure is continuous, not bursty. Budget for it. |
Next: Master Reference — Interactive Walk-Through → — the whole chapter in one scroll, animated. A different angle on everything you just learned.