Skip to main content

Cluster Sizing & Cabling

You've picked the design (3.2) and the deep-dive (3.3 — ROD). Now: how many leaves? How many spines? How much fiber? This page is the BOM math and the day-1 install reality. Vendor stacks — Spectrum-X, Tomahawk, Jericho3-AI, Silicon One, Arista — live on the blog.

1. Reference designs at each scale

The numbers below assume 400G per GPU NIC and 8 GPUs per server — the modern H100/H200/B200 baseline. Swap to 800G and the GPU counts double for the same switch radix.

256 GPUs — single pod, single leaf per rail

32 servers × 8 GPUs. Eight rail leaves at 32 × 400G each. NIC 0 from all 32 servers lands on Rail Leaf 0; NIC 1 on Rail Leaf 1; and so on. Each rail terminates on one leaf, so AllReduce traffic for any GPU rank never crosses a switch boundary — no spine is required for the dominant traffic pattern. Caveat: if your collectives do significant cross-rail traffic (AllToAll, MoE all-to-all), bolt on a small spine pair so the inter-rail hop has a real path. Pure AllReduce / AllGather workloads do not need it at this scale.

Aggregate host-side bandwidth: 256 × 400G = 102 Tbps (256 GPUs × 1 NIC). One bad leaf knocks out 32 GPUs (one per server) — degraded but the job can keep training on the remaining 7 rails.

1024 GPUs — single pod, 2-tier

128 servers × 8 GPUs. Eight rails of 4 leaves each = 32 leaves total. 8 spines. Each leaf: 32 × 400G down (servers) + 16 × 400G up (spine) = 1:2 oversub at the leaf. Some shops push for 1:1 here by going to R = 64 leaves — your call, your budget.

Spine fanout: each spine takes 4 uplinks from every leaf in its rail (4 leaves × 8 rails = 32 ports). One spine = 32 × 400G. Sized to a Tomahawk 4 (25.6 Tbps) or half a Tomahawk 5.

4096 GPUs — multi-pod, 3-tier

512 servers. 4 pods × 1024 GPUs. 128 leaves, 32 spines, 16 super-spines. A pod-local job (and most jobs are scheduled this way) stays 2 hops: leaf → spine → leaf. Cross-pod adds the super-spine hop — 4 hops worst case.

This is where rail-affinity scheduling starts paying for itself. If the scheduler keeps a 1K job inside one pod, you never touch the super-spine layer and you save the cross-pod tail latency.

16384 GPUs — multi-pod, near the limit

2048 servers. 16 pods × 1024 GPUs. ~512 leaves, 128 spines, 64 super-spines. You are approaching the single-fabric ECMP limits — hash polarization across 128-way fanout gets ugly fast, and link-failure blast radius starts to matter.

Start evaluating multi-planar here. A single 16K fabric still works, but 32K with the same shape will not.

100K+ GPUs — multi-planar territory

Single CLOS no longer works. 8 parallel planes, each one a full ~16K-GPU fabric. Each NIC index goes to its own plane — NIC 0 to Plane 0, NIC 1 to Plane 1, all the way to NIC 7 to Plane 7. Planes never talk to each other; the collective library knows which NIC is on which plane and steers traffic accordingly. (Cross-reference: see the Multi-Planar pattern in 3.2 Design Options.)

2. Switch radix math

The whole BOM falls out of one formula. A leaf has R ports. If R = 64 and each port is 400G:

  • 32 ports down (servers) + 32 ports up (spine) = 1:1 oversub at the leaf.
  • Pod size for one rail at 1:1 = R/2 servers. For R = 64: 32 servers per rail × 8 rails = 256 servers / 2048 GPUs per pod.
  • Spine ports needed per rail = (leaves per rail) × (uplink ports per leaf). Spine radix sets the cap on leaves per rail.

Chip cheat-sheet for the current generation:

SiliconCapacityPort options
Broadcom Tomahawk 425.6 Tbps64 × 400G or 32 × 800G
Broadcom Tomahawk 551.2 Tbps128 × 400G or 64 × 800G
NVIDIA Spectrum-451.2 Tbps64 × 800G (OSFP)

Doubling the silicon doubles the radix, which roughly quadruples the GPUs you can host in a 2-tier pod before you're forced into a third tier.

3. Transceivers — OSFP, AOC, DAC

You will buy more optics than switches. Pick by reach, not by price alone — a $40 DAC that doesn't reach the spine row is a $40 paperweight.

TypeReachCostWhen
DAC (Direct Attach Copper)≤ 3 mCheapestIntra-rack: leaf → server in same rack
AOC (Active Optical Cable)≤ 30 mMiddleRack-to-adjacent-rack: leaf → server, or leaf → spine in same row
OSFP MMF (Multi-Mode Fiber)≤ 100 mHigherAcross the row: spine → super-spine
OSFP SMF (Single-Mode Fiber)km-classHighestCross-DC, cross-building

A note on form factor: OSFP and QSFP-DD are both 8-lane 400G/800G modules, mechanically incompatible. NVIDIA Spectrum-4 is OSFP. Many Broadcom Tomahawk designs ship as QSFP-DD. Mixing vendors means an adapter cage or a parallel optic SKU — plan it in the BOM.

4. Cabling for rail-optimized — the labelling scheme

Each server has 8 cables going to 8 different leaves. That is the structural fact you keep in your head. Mislabeling is the day-1 install bug — every cluster build hits it, the only question is how fast you find it.

Standard scheme: <server-id>:<nic-index> → <rail-leaf-id>:<port>. Example: srv-04:nic3 → rail-leaf-3:port-12. Both ends of the cable get the same label, printed twice, heat-shrunk.

Layer a colour code on top of that — Rail 0 = yellow boot, Rail 1 = blue, Rail 2 = green, and so on through 8 distinct colours. The colour catches the gross mis-rack ("why is there a yellow boot on Rail Leaf 3?") before you ever boot a host.

Run a cabling checker script post-install: each host pings its expected rail leaf on every NIC and refuses to boot the workload if NIC 3 lands on Rail 5 instead of Rail 3. Cheap to write, saves a week of NCCL debugging.

Spare 10–20% transceivers from day 1. At 16K GPUs you're racking ~32,000 optics — even at a 0.1% annual failure rate that's 32 modules a year, and they always fail on a Friday.

Anti-pattern

Skipping the cable-labelling scheme to save time on day 1. You'll spend the same time × 10 finding which NIC went to the wrong rail when the first AllReduce hangs and a thousand GPUs sit idle while you trace fiber under a raised floor. Label before you rack.

💡 What you should remember

#ConceptWhy it matters
1📐R/2 servers per rail at 1:1The whole BOM rolls up from leaf radix. R=64 → 32 servers per rail × 8 rails = 256 servers / 2048 GPUs per pod at 1:1.
2🏗️3 tiers buy you ~4K–16K GPUsPast 16K, single-fabric ECMP cracks. Move to multi-planar.
3🔌DAC inside the rack, AOC across the row, MMF across the hallReach picks the optic, not price.
4🏷️Label both ends of every cable8 cables per server × 2048 servers = 16K chances to get it wrong.
5🎨Colour-code by railThe first-pass visual catch before any host boots.
6📦Spare 10–20% opticsOptical failure is continuous, not bursty. Budget for it.

Next: Master Reference — Interactive Walk-Through → — the whole chapter in one scroll, animated. A different angle on everything you just learned.