Sizing & Bill of Materials
You've understood every layer. You now have to convince finance and procurement to actually buy this thing. This page is the shopping list — pod sizes, server choices, switch choices, cables, and the supporting networks you'll forget about until they bite you.
- Size a pod — pick from the four tiers (16 / 64 / 256 / 1024 GPUs), map it to leaf/spine counts, and justify why 256 GPUs is the enterprise sweet spot.
- Spec the BoM — choose a GPU server (HGX H100 from a Tier-1 OEM), pick interchangeable leaf/spine switches (Spectrum-4, Arista 7060X, Tomahawk white-box), and pick cable types (DAC
<3 m, AOC<30 m, fiber 100 m+) by distance and PFC headroom. - Plan the four networks — training plus the three everyone forgets: storage (100–400 GB/s), management (
eth0, 25–100 G), and OOB (BMC/iDRAC, 1 G). - Defend the budget — show the network is only ~5–10% of a
~$8M–$11M256-GPU pod, so skimping there to save GPU efficiency is a bad trade.
Pod sizing — the four tiers
A pod is one self-contained training fabric, non-blocking at the rated GPU count. Most operators settle on one of these four sizes:
| Pod size | Servers | Rails / leaves | Spines | Use case |
|---|---|---|---|---|
| 16 GPUs | 2 × 8-GPU | 0 (direct server-to-server DAC, one per rail) | 0 | Dev / single-job experiments |
| 64 GPUs | 8 × 8-GPU | 8 (1 per rail, 8 ports each) | 0 (2-tier) | Small team training |
| 256 GPUs | 32 × 8-GPU | 8 (1 per rail, 32 ports each) | 4 | Production training, single pod |
| 1024 GPUs | 128 × 8-GPU | 16 (2 per rail with super-spine) | 8 | Frontier training, multi-pod |
Sizing rule of thumb: target the smallest pod that fits your largest expected job. Going bigger costs more in switches; going smaller forces multi-pod jobs across slower super-spine paths.
Pods scale in powers of 2. The 256-GPU pod is the sweet spot for most enterprise AI work — non-blocking, 3-tier topology fits cleanly, well-understood ops.
Which pod size should you pick?
Map your situation to a tier before you spec a single switch. Your job is to pick the smallest pod that fits your largest job, then adjust for your primary constraint — cost, isolation, scale, or upgrade tolerance.
| I want… | Pick | Why |
|---|---|---|
| Lowest-cost dev / a few-GPU experiments | 16 GPUs | Zero leaf, zero spine — direct server-to-server DAC per rail, so you pay for no switches at all. |
| To train for one small team on a budget | 64 GPUs | 8 leaves, no spines — a flat 2-tier fabric that's non-blocking with the fewest switches above 16. |
| Production training in a single tenant | 256 GPUs | The enterprise sweet spot — non-blocking 3-tier (8 leaves, 4 spines) that fits cleanly with no super-spine to operate. |
| Smallest blast radius / rolling upgrades | 256 GPUs | One self-contained pod stays under your largest job, so a leaf or spine drain takes down one pod's worth of GPUs, not the whole fleet. |
| To host many teams without cross-tenant noise | Multiple 256-GPU pods | Per-pod maintenance and isolation — each team lives in its own non-blocking pod instead of contending on shared spines. |
| Aggressive frontier-scale training | 1024 GPUs | 16 leaves + 8 spines + super-spine is the only tier that goes multi-pod — accept the added super-spine complexity to clear a single job past 256 GPUs. |
GPU server — the reference options
Pick one. Don't mix in the same pod (NCCL hates heterogeneity).
| Server | Vendor | GPUs | NIC config | NVLink | Notes |
|---|---|---|---|---|---|
| DGX H100 | NVIDIA | 8 × H100 SXM5 (80 GB) | 8 × ConnectX-7 (400 G) | NVSwitch, 900 GB/s | The reference. Hardest to source, easiest to support. |
| HGX H100 / B200 | OEMs (Dell, HPE, Supermicro) | 8 × H100 or B200 | 8 × CX-7 or CX-8 (800 G) | NVSwitch | Same baseboard, broader vendor choice. |
| MGX (modular) | NVIDIA + OEMs | 4 / 8 GPUs, configurable | Customer choice | NVSwitch | Newer; pick your CPU vendor + NIC vendor. |
| OCP Grand Teton (MI300X) | OCP / OEMs (AMD) | 8 × MI300X (192 GB) | 8 × Broadcom Thor or CX-7 (400 G) | Infinity Fabric (xGMI), ~896 GB/s–1.2 TB/s | Open Compute design; 192 GB HBM per GPU vs 80 GB. The AMD second source. |
| HPE Cray XD | HPE (AMD) | 8 × MI300X (192 GB) | 8 × Broadcom Thor or CX-7 (400 G) | Infinity Fabric (xGMI) | Tier-1 OEM path for Instinct; same RoCE v2 back-end as NVIDIA. |
| Microsoft Maia | Microsoft (custom) | 8 × Maia 100 | Built-in 400 G | Custom | Azure-only. Not for sale. |
| Meta Grand Teton | Meta (custom) | 8 × H100 | 8 × CX-7 | NVSwitch | Open Compute design; available through OCP. |
For most operators: HGX H100 from a Tier-1 OEM is the practical choice. DGX has the cleanest software story but is supply-constrained; HGX gives you the same baseboard with broader vendor support.
If memory capacity per GPU is the constraint — large models, KV-cache-heavy serving — an MI300X platform (OCP Grand Teton or HPE Cray XD) is the credible AMD alternative: 192 GB HBM3 per GPU vs 80 GB, with the same RoCE v2 back-end fabric (Broadcom Thor or ConnectX-7 NICs). The software story is ROCm/RCCL instead of CUDA/NCCL — RCCL is a drop-in for NCCL and reuses the same NCCL_* env-var names. Don't mix vendors in one pod either way.
Switches — match the pod size
| Pod | Leaf switches | Spine switches |
|---|---|---|
| 16 GPUs | None (direct GPU-to-GPU on NVLink) | None |
| 64 GPUs | 8 × low-radix (16 × 400G each) | None — leaves connect directly |
| 256 GPUs | 8 × mid-radix (48 × 400G each) | 4 × high-radix (64 × 400G each) |
| 1024+ GPUs | 16+ × high-radix | 8+ × high-radix + super-spines |
Switch vendor choices (any of these work — pick based on existing relationships):
| Vendor | Leaf option | Spine option |
|---|---|---|
| NVIDIA Spectrum-X | Spectrum-4 SN5600 (64×400G) | Spectrum-4 SN5600 |
| Arista | 7060X6 (32×400G) | 7800R3 (128×400G modular) |
| Cisco | Nexus 9332D-H2R (32×400G) | Nexus 9408 (128×400G) |
| Broadcom Tomahawk-based | Edgecore AS9716 / Celestica DS5000 | Same chassis, more ports |
| Juniper | QFX5240 (64×400G) | QFX5240 |
These are a trade-off, not a ranking — all are lossless-capable, so pick on operational fit:
- NVIDIA Spectrum-X — adaptive routing in silicon plus NIC + switch co-tuning. Tightest integration if you're already all-NVIDIA end to end.
- Arista R3 — open telemetry and deep buffers, strong observability story, multi-vendor NICs.
- Broadcom Tomahawk / SONiC — the most open path: white-box hardware, your choice of NOS, no single-vendor lock-in.
Each runs the same standards on the wire (BGP, RoCE v2, PFC, ECN), so the decision is about ops fit and existing relationships, not raw capability.
Cabling — the part that bites you
Cable length and type determine PFC headroom and what your cost looks like.
| Type | Distance | Cost per pair | Use case |
|---|---|---|---|
| DAC (Direct Attach Copper) | up to 3 m | ~$50 | Within rack (server to ToR) |
| AOC (Active Optical Cable) | up to 30 m | ~$300 | End of row, rack-to-rack |
| Fiber + transceiver pair | 100+ m | ~$600 (incl. 2 × 400G optics) | Spine-to-leaf across the DC |
Sizing math for a 256-GPU pod:
- 32 servers × 8 NICs × DAC to leaf = 256 DAC cables in-rack
- 8 leaves × 4 spines × 1 cable each = 32 fiber + optics between leaf and spine
Total cable cost for a 256-GPU pod: roughly $30K in cabling (a rounding error vs the GPUs but worth tracking).
Cable length affects PFC headroom. Most modern switches auto-detect. Older switches require manual configuration — and getting it wrong causes silent drops. Buy cables in the lengths you actually need; mixed-length runs make debugging harder.
The supporting networks you'll forget about
It's not just the training fabric. A real cluster has three more networks running in parallel:
Storage network
Training data lives somewhere — NVMe-over-Fabrics, Lustre, GPFS, S3-compatible object store. For a 256-GPU pod:
- Throughput needed: depends on training step time. Often 100-400 GB/s aggregate read.
- Topology: usually a separate spine-leaf, or shared with training fabric but on a different priority class.
- NIC: sometimes a separate storage NIC per server (often slower — 100-200 G), or shared on the training NIC with QoS isolation.
Management network
How you SSH to servers, fetch logs, push configs.
- Speed: 25-100 G is fine. Not performance-critical.
- Connectivity: every server's
eth0(the in-band management interface). - Don't put RDMA traffic on this — it gets congested by everything else.
Out-of-band (OOB) network
The "console" network — BMC / iDRAC / iLO. How you reboot a server when it stops responding to in-band.
- Speed: 1 G is plenty.
- Critical for ops: without it, you walk to the datacenter.
Budget all three when you plan the pod. They each need ~2 dedicated switches at a minimum.
A worked example — what a 256-GPU pod actually costs
Order-of-magnitude only. Real prices depend on volume discounts, relationships, and current GPU supply.
| Line item | Approximate cost |
|---|---|
| 32 × HGX H100 servers (8 × H100, 8 × CX-7, dual Xeon, 2 TB RAM) | $7.5M–$10M |
| 8 × Spectrum-4 leaf switches (or Arista 7060X) | $400K |
| 4 × Spectrum-4 spine switches | $300K |
| Cables (DAC + AOC + fiber + optics) | $30K |
| Storage + management + OOB networks | $200K |
| Power + cooling + rack + cabling labor | $200K |
| Total infrastructure | ~$8M–$11M |
Most of the cost is the GPUs themselves (~$30K each × 256 = $7.5M). The network is roughly 5–10% of the total. That's why "skimping on the network" makes no sense — a 10% network savings to lose 20% of GPU efficiency is a terrible trade.
💡 What you should remember
| # | Concept | Why it matters | |
|---|---|---|---|
| 1 | 🔢 | Pods scale in powers of 2. | 64 / 256 / 1024 are the sweet spots. |
| 2 | 🖥️ | Pick one server design and don't mix. | NCCL doesn't love heterogeneous clusters. |
| 3 | 🔁 | Switch vendors are interchangeable at the protocol level | (BGP, RoCE v2, PFC, ECN are all standards). Pick on operational fit, not features. |
| 4 | 🔌 | Cable length affects PFC headroom. | Modern switches auto-detect; older ones need manual config. |
| 5 | 🌐 | You're really buying four networks: | training, storage, management, OOB. Plan for all of them. |
| 6 | 💰 | The network is ~5–10% of pod cost. | Don't optimize it down at the expense of GPU efficiency. |
Next: Configure the Fabric → — switch-by-switch, the BGP underlay, QoS, PFC, ECN, and buffer config that makes the fabric lossless.