Skip to main content

Sizing & Bill of Materials

You've understood every layer. You now have to convince finance and procurement to actually buy this thing. This page is the shopping list — pod sizes, server choices, switch choices, cables, and the supporting networks you'll forget about until they bite you.


Pod sizing — the four tiers

A pod is one self-contained training fabric, non-blocking at the rated GPU count. Most operators settle on one of these four sizes:

Pod sizeServersRails / leavesSpinesUse case
16 GPUs2 × 8-GPU2 (no leaf — direct)0Dev / single-job experiments
64 GPUs8 × 8-GPU8 (1 per rail, 8 ports each)0 (2-tier)Small team training
256 GPUs32 × 8-GPU8 (1 per rail, 32 ports each)4Production training, single pod
1024 GPUs128 × 8-GPU16 (2 per rail with super-spine)8Frontier training, multi-pod

Sizing rule of thumb: target the smallest pod that fits your largest expected job. Going bigger costs more in switches; going smaller forces multi-pod jobs across slower super-spine paths.

Pods scale in powers of 2. The 256-GPU pod is the sweet spot for most enterprise AI work — non-blocking, 3-tier topology fits cleanly, well-understood ops.


GPU server — the reference options

Pick one. Don't mix in the same pod (NCCL hates heterogeneity).

ServerVendorGPUsNIC configNVLinkNotes
DGX H100NVIDIA8 × H100 SXM5 (80 GB)8 × ConnectX-7 (400 G)NVSwitch, 900 GB/sThe reference. Hardest to source, easiest to support.
HGX H100 / B200OEMs (Dell, HPE, Supermicro)8 × H100 or B2008 × CX-7 or CX-8 (800 G)NVSwitchSame baseboard, broader vendor choice.
MGX (modular)NVIDIA + OEMs4 / 8 GPUs, configurableCustomer choiceNVSwitchNewer; pick your CPU vendor + NIC vendor.
Microsoft MaiaMicrosoft (custom)8 × Maia 100Built-in 400 GCustomAzure-only. Not for sale.
Meta Grand TetonMeta (custom)8 × H1008 × CX-7NVSwitchOpen Compute design; available through OCP.

For most operators: HGX H100 from a Tier-1 OEM is the practical choice. DGX has the cleanest software story but is supply-constrained; HGX gives you the same baseboard with broader vendor support.


Switches — match the pod size

PodLeaf switchesSpine switches
16 GPUsNone (direct GPU-to-GPU on NVLink)None
64 GPUs8 × low-radix (16 × 400G each)None — leaves connect directly
256 GPUs8 × mid-radix (48 × 400G each)4 × high-radix (64 × 400G each)
1024+ GPUs16+ × high-radix8+ × high-radix + super-spines

Switch vendor choices (any of these work — pick based on existing relationships):

VendorLeaf optionSpine option
NVIDIA Spectrum-XSpectrum-4 SN5600 (64×400G)Spectrum-4 SN5600
Arista7060X6 (32×400G)7800R3 (128×400G modular)
CiscoNexus 9332D-H2R (32×400G)Nexus 9408 (128×400G)
Broadcom Tomahawk-basedEdgecore AS9716 / Celestica DS5000Same chassis, more ports
JuniperQFX5240 (64×400G)QFX5240

The Spectrum-X stack is most aggressive on AI-specific features (adaptive routing in silicon, NIC + switch co-tuning). Arista and Tomahawk-based are the open / multi-vendor choices.


Cabling — the part that bites you

Cable length and type determine PFC headroom and what your cost looks like.

TypeDistanceCost per pairUse case
DAC (Direct Attach Copper)up to 3 m~$50Within rack (server to ToR)
AOC (Active Optical Cable)up to 30 m~$300End of row, rack-to-rack
Fiber + transceiver pair100+ m~$600 (incl. 2 × 400G optics)Spine-to-leaf across the DC

Sizing math for a 256-GPU pod:

  • 32 servers × 8 NICs × DAC to leaf = 256 DAC cables in-rack
  • 8 leaves × 4 spines × 1 cable each = 32 fiber + optics between leaf and spine

Total cable cost for a 256-GPU pod: roughly $30K in cabling (a rounding error vs the GPUs but worth tracking).

Cable length affects PFC headroom. Most modern switches auto-detect. Older switches require manual configuration — and getting it wrong causes silent drops. Buy cables in the lengths you actually need; mixed-length runs make debugging harder.


The supporting networks you'll forget about

It's not just the training fabric. A real cluster has three more networks running in parallel:

Storage network

Training data lives somewhere — NVMe-over-Fabrics, Lustre, GPFS, S3-compatible object store. For a 256-GPU pod:

  • Throughput needed: depends on training step time. Often 100-400 GB/s aggregate read.
  • Topology: usually a separate spine-leaf, or shared with training fabric but on a different priority class.
  • NIC: sometimes a separate storage NIC per server (often slower — 100-200 G), or shared on the training NIC with QoS isolation.

Management network

How you SSH to servers, fetch logs, push configs.

  • Speed: 25-100 G is fine. Not performance-critical.
  • Connectivity: every server's eth0 (the in-band management interface).
  • Don't put RDMA traffic on this — it gets congested by everything else.

Out-of-band (OOB) network

The "console" network — BMC / iDRAC / iLO. How you reboot a server when it stops responding to in-band.

  • Speed: 1 G is plenty.
  • Critical for ops: without it, you walk to the datacenter.

Budget all three when you plan the pod. They each need ~2 dedicated switches at a minimum.


A worked example — what a 256-GPU pod actually costs

Order-of-magnitude only. Real prices depend on volume discounts, relationships, and current GPU supply.

Line itemApproximate cost
32 × HGX H100 servers (8 × H100, 8 × CX-7, dual Xeon, 2 TB RAM)$7.5M–$10M
8 × Spectrum-4 leaf switches (or Arista 7060X)$400K
4 × Spectrum-4 spine switches$300K
Cables (DAC + AOC + fiber + optics)$30K
Storage + management + OOB networks$200K
Power + cooling + rack + cabling labor$200K
Total infrastructure~$8M–$11M

Most of the cost is the GPUs themselves (~$30K each × 256 = $7.5M). The network is roughly 5–10% of the total. That's why "skimping on the network" makes no sense — a 10% network savings to lose 20% of GPU efficiency is a terrible trade.


What you should remember

  • Pods scale in powers of 2. 64 / 256 / 1024 are the sweet spots.
  • Pick one server design and don't mix. NCCL doesn't love heterogeneous clusters.
  • Switch vendors are interchangeable at the protocol level (BGP, RoCE v2, PFC, ECN are all standards). Pick on operational fit, not features.
  • Cable length affects PFC headroom. Modern switches auto-detect; older ones need manual config.
  • You're really buying four networks: training, storage, management, OOB. Plan for all of them.
  • The network is ~5–10% of pod cost. Don't optimize it down at the expense of GPU efficiency.

Next: Configure the Fabric → — switch-by-switch, the BGP underlay, QoS, PFC, ECN, and buffer config that makes the fabric lossless.