Sizing & Bill of Materials
You've understood every layer. You now have to convince finance and procurement to actually buy this thing. This page is the shopping list — pod sizes, server choices, switch choices, cables, and the supporting networks you'll forget about until they bite you.
Pod sizing — the four tiers
A pod is one self-contained training fabric, non-blocking at the rated GPU count. Most operators settle on one of these four sizes:
| Pod size | Servers | Rails / leaves | Spines | Use case |
|---|---|---|---|---|
| 16 GPUs | 2 × 8-GPU | 2 (no leaf — direct) | 0 | Dev / single-job experiments |
| 64 GPUs | 8 × 8-GPU | 8 (1 per rail, 8 ports each) | 0 (2-tier) | Small team training |
| 256 GPUs | 32 × 8-GPU | 8 (1 per rail, 32 ports each) | 4 | Production training, single pod |
| 1024 GPUs | 128 × 8-GPU | 16 (2 per rail with super-spine) | 8 | Frontier training, multi-pod |
Sizing rule of thumb: target the smallest pod that fits your largest expected job. Going bigger costs more in switches; going smaller forces multi-pod jobs across slower super-spine paths.
Pods scale in powers of 2. The 256-GPU pod is the sweet spot for most enterprise AI work — non-blocking, 3-tier topology fits cleanly, well-understood ops.
GPU server — the reference options
Pick one. Don't mix in the same pod (NCCL hates heterogeneity).
| Server | Vendor | GPUs | NIC config | NVLink | Notes |
|---|---|---|---|---|---|
| DGX H100 | NVIDIA | 8 × H100 SXM5 (80 GB) | 8 × ConnectX-7 (400 G) | NVSwitch, 900 GB/s | The reference. Hardest to source, easiest to support. |
| HGX H100 / B200 | OEMs (Dell, HPE, Supermicro) | 8 × H100 or B200 | 8 × CX-7 or CX-8 (800 G) | NVSwitch | Same baseboard, broader vendor choice. |
| MGX (modular) | NVIDIA + OEMs | 4 / 8 GPUs, configurable | Customer choice | NVSwitch | Newer; pick your CPU vendor + NIC vendor. |
| Microsoft Maia | Microsoft (custom) | 8 × Maia 100 | Built-in 400 G | Custom | Azure-only. Not for sale. |
| Meta Grand Teton | Meta (custom) | 8 × H100 | 8 × CX-7 | NVSwitch | Open Compute design; available through OCP. |
For most operators: HGX H100 from a Tier-1 OEM is the practical choice. DGX has the cleanest software story but is supply-constrained; HGX gives you the same baseboard with broader vendor support.
Switches — match the pod size
| Pod | Leaf switches | Spine switches |
|---|---|---|
| 16 GPUs | None (direct GPU-to-GPU on NVLink) | None |
| 64 GPUs | 8 × low-radix (16 × 400G each) | None — leaves connect directly |
| 256 GPUs | 8 × mid-radix (48 × 400G each) | 4 × high-radix (64 × 400G each) |
| 1024+ GPUs | 16+ × high-radix | 8+ × high-radix + super-spines |
Switch vendor choices (any of these work — pick based on existing relationships):
| Vendor | Leaf option | Spine option |
|---|---|---|
| NVIDIA Spectrum-X | Spectrum-4 SN5600 (64×400G) | Spectrum-4 SN5600 |
| Arista | 7060X6 (32×400G) | 7800R3 (128×400G modular) |
| Cisco | Nexus 9332D-H2R (32×400G) | Nexus 9408 (128×400G) |
| Broadcom Tomahawk-based | Edgecore AS9716 / Celestica DS5000 | Same chassis, more ports |
| Juniper | QFX5240 (64×400G) | QFX5240 |
The Spectrum-X stack is most aggressive on AI-specific features (adaptive routing in silicon, NIC + switch co-tuning). Arista and Tomahawk-based are the open / multi-vendor choices.
Cabling — the part that bites you
Cable length and type determine PFC headroom and what your cost looks like.
| Type | Distance | Cost per pair | Use case |
|---|---|---|---|
| DAC (Direct Attach Copper) | up to 3 m | ~$50 | Within rack (server to ToR) |
| AOC (Active Optical Cable) | up to 30 m | ~$300 | End of row, rack-to-rack |
| Fiber + transceiver pair | 100+ m | ~$600 (incl. 2 × 400G optics) | Spine-to-leaf across the DC |
Sizing math for a 256-GPU pod:
- 32 servers × 8 NICs × DAC to leaf = 256 DAC cables in-rack
- 8 leaves × 4 spines × 1 cable each = 32 fiber + optics between leaf and spine
Total cable cost for a 256-GPU pod: roughly $30K in cabling (a rounding error vs the GPUs but worth tracking).
Cable length affects PFC headroom. Most modern switches auto-detect. Older switches require manual configuration — and getting it wrong causes silent drops. Buy cables in the lengths you actually need; mixed-length runs make debugging harder.
The supporting networks you'll forget about
It's not just the training fabric. A real cluster has three more networks running in parallel:
Storage network
Training data lives somewhere — NVMe-over-Fabrics, Lustre, GPFS, S3-compatible object store. For a 256-GPU pod:
- Throughput needed: depends on training step time. Often 100-400 GB/s aggregate read.
- Topology: usually a separate spine-leaf, or shared with training fabric but on a different priority class.
- NIC: sometimes a separate storage NIC per server (often slower — 100-200 G), or shared on the training NIC with QoS isolation.
Management network
How you SSH to servers, fetch logs, push configs.
- Speed: 25-100 G is fine. Not performance-critical.
- Connectivity: every server's
eth0(the in-band management interface). - Don't put RDMA traffic on this — it gets congested by everything else.
Out-of-band (OOB) network
The "console" network — BMC / iDRAC / iLO. How you reboot a server when it stops responding to in-band.
- Speed: 1 G is plenty.
- Critical for ops: without it, you walk to the datacenter.
Budget all three when you plan the pod. They each need ~2 dedicated switches at a minimum.
A worked example — what a 256-GPU pod actually costs
Order-of-magnitude only. Real prices depend on volume discounts, relationships, and current GPU supply.
| Line item | Approximate cost |
|---|---|
| 32 × HGX H100 servers (8 × H100, 8 × CX-7, dual Xeon, 2 TB RAM) | $7.5M–$10M |
| 8 × Spectrum-4 leaf switches (or Arista 7060X) | $400K |
| 4 × Spectrum-4 spine switches | $300K |
| Cables (DAC + AOC + fiber + optics) | $30K |
| Storage + management + OOB networks | $200K |
| Power + cooling + rack + cabling labor | $200K |
| Total infrastructure | ~$8M–$11M |
Most of the cost is the GPUs themselves (~$30K each × 256 = $7.5M). The network is roughly 5–10% of the total. That's why "skimping on the network" makes no sense — a 10% network savings to lose 20% of GPU efficiency is a terrible trade.
What you should remember
- Pods scale in powers of 2. 64 / 256 / 1024 are the sweet spots.
- Pick one server design and don't mix. NCCL doesn't love heterogeneous clusters.
- Switch vendors are interchangeable at the protocol level (BGP, RoCE v2, PFC, ECN are all standards). Pick on operational fit, not features.
- Cable length affects PFC headroom. Modern switches auto-detect; older ones need manual config.
- You're really buying four networks: training, storage, management, OOB. Plan for all of them.
- The network is ~5–10% of pod cost. Don't optimize it down at the expense of GPU efficiency.
Next: Configure the Fabric → — switch-by-switch, the BGP underlay, QoS, PFC, ECN, and buffer config that makes the fabric lossless.