Sizing & Bill of Materials

You've understood every layer. You now have to convince finance and procurement to actually buy this thing. This page is the shopping list — pod sizes, server choices, switch choices, cables, and the supporting networks you'll forget about until they bite you.

Pod sizing — the four tiers

A pod is one self-contained training fabric, non-blocking at the rated GPU count. Most operators settle on one of these four sizes:

Pod size	Servers	Rails / leaves	Spines	Use case
16 GPUs	2 × 8-GPU	2 (no leaf — direct)	0	Dev / single-job experiments
64 GPUs	8 × 8-GPU	8 (1 per rail, 8 ports each)	0 (2-tier)	Small team training
256 GPUs	32 × 8-GPU	8 (1 per rail, 32 ports each)	4	Production training, single pod
1024 GPUs	128 × 8-GPU	16 (2 per rail with super-spine)	8	Frontier training, multi-pod

Sizing rule of thumb: target the smallest pod that fits your largest expected job. Going bigger costs more in switches; going smaller forces multi-pod jobs across slower super-spine paths.

Pods scale in powers of 2. The 256-GPU pod is the sweet spot for most enterprise AI work — non-blocking, 3-tier topology fits cleanly, well-understood ops.

GPU server — the reference options

Pick one. Don't mix in the same pod (NCCL hates heterogeneity).

Server	Vendor	GPUs	NIC config	NVLink	Notes
DGX H100	NVIDIA	8 × H100 SXM5 (80 GB)	8 × ConnectX-7 (400 G)	NVSwitch, 900 GB/s	The reference. Hardest to source, easiest to support.
HGX H100 / B200	OEMs (Dell, HPE, Supermicro)	8 × H100 or B200	8 × CX-7 or CX-8 (800 G)	NVSwitch	Same baseboard, broader vendor choice.
MGX (modular)	NVIDIA + OEMs	4 / 8 GPUs, configurable	Customer choice	NVSwitch	Newer; pick your CPU vendor + NIC vendor.
Microsoft Maia	Microsoft (custom)	8 × Maia 100	Built-in 400 G	Custom	Azure-only. Not for sale.
Meta Grand Teton	Meta (custom)	8 × H100	8 × CX-7	NVSwitch	Open Compute design; available through OCP.

For most operators: HGX H100 from a Tier-1 OEM is the practical choice. DGX has the cleanest software story but is supply-constrained; HGX gives you the same baseboard with broader vendor support.

Switches — match the pod size

Pod	Leaf switches	Spine switches
16 GPUs	None (direct GPU-to-GPU on NVLink)	None
64 GPUs	8 × low-radix (16 × 400G each)	None — leaves connect directly
256 GPUs	8 × mid-radix (48 × 400G each)	4 × high-radix (64 × 400G each)
1024+ GPUs	16+ × high-radix	8+ × high-radix + super-spines

Switch vendor choices (any of these work — pick based on existing relationships):

Vendor	Leaf option	Spine option
NVIDIA Spectrum-X	Spectrum-4 SN5600 (64×400G)	Spectrum-4 SN5600
Arista	7060X6 (32×400G)	7800R3 (128×400G modular)
Cisco	Nexus 9332D-H2R (32×400G)	Nexus 9408 (128×400G)
Broadcom Tomahawk-based	Edgecore AS9716 / Celestica DS5000	Same chassis, more ports
Juniper	QFX5240 (64×400G)	QFX5240

The Spectrum-X stack is most aggressive on AI-specific features (adaptive routing in silicon, NIC + switch co-tuning). Arista and Tomahawk-based are the open / multi-vendor choices.

Cabling — the part that bites you

Cable length and type determine PFC headroom and what your cost looks like.

Type	Distance	Cost per pair	Use case
DAC (Direct Attach Copper)	up to 3 m	~$50	Within rack (server to ToR)
AOC (Active Optical Cable)	up to 30 m	~$300	End of row, rack-to-rack
Fiber + transceiver pair	100+ m	~$600 (incl. 2 × 400G optics)	Spine-to-leaf across the DC

Sizing math for a 256-GPU pod:

32 servers × 8 NICs × DAC to leaf = 256 DAC cables in-rack
8 leaves × 4 spines × 1 cable each = 32 fiber + optics between leaf and spine

Total cable cost for a 256-GPU pod: roughly $30K in cabling (a rounding error vs the GPUs but worth tracking).

Cable length affects PFC headroom. Most modern switches auto-detect. Older switches require manual configuration — and getting it wrong causes silent drops. Buy cables in the lengths you actually need; mixed-length runs make debugging harder.

The supporting networks you'll forget about

It's not just the training fabric. A real cluster has three more networks running in parallel:

Storage network

Training data lives somewhere — NVMe-over-Fabrics, Lustre, GPFS, S3-compatible object store. For a 256-GPU pod:

Throughput needed: depends on training step time. Often 100-400 GB/s aggregate read.
Topology: usually a separate spine-leaf, or shared with training fabric but on a different priority class.
NIC: sometimes a separate storage NIC per server (often slower — 100-200 G), or shared on the training NIC with QoS isolation.

Management network

How you SSH to servers, fetch logs, push configs.

Speed: 25-100 G is fine. Not performance-critical.
Connectivity: every server's eth0 (the in-band management interface).
Don't put RDMA traffic on this — it gets congested by everything else.

Out-of-band (OOB) network

The "console" network — BMC / iDRAC / iLO. How you reboot a server when it stops responding to in-band.

Speed: 1 G is plenty.
Critical for ops: without it, you walk to the datacenter.

Budget all three when you plan the pod. They each need ~2 dedicated switches at a minimum.

A worked example — what a 256-GPU pod actually costs

Order-of-magnitude only. Real prices depend on volume discounts, relationships, and current GPU supply.

Line item	Approximate cost
32 × HGX H100 servers (8 × H100, 8 × CX-7, dual Xeon, 2 TB RAM)	$7.5M–$10M
8 × Spectrum-4 leaf switches (or Arista 7060X)	$400K
4 × Spectrum-4 spine switches	$300K
Cables (DAC + AOC + fiber + optics)	$30K
Storage + management + OOB networks	$200K
Power + cooling + rack + cabling labor	$200K
Total infrastructure	~$8M–$11M

Most of the cost is the GPUs themselves (~$30K each × 256 = $7.5M). The network is roughly 5–10% of the total. That's why "skimping on the network" makes no sense — a 10% network savings to lose 20% of GPU efficiency is a terrible trade.

What you should remember

Pods scale in powers of 2. 64 / 256 / 1024 are the sweet spots.
Pick one server design and don't mix. NCCL doesn't love heterogeneous clusters.
Switch vendors are interchangeable at the protocol level (BGP, RoCE v2, PFC, ECN are all standards). Pick on operational fit, not features.
Cable length affects PFC headroom. Modern switches auto-detect; older ones need manual config.
You're really buying four networks: training, storage, management, OOB. Plan for all of them.
The network is ~5–10% of pod cost. Don't optimize it down at the expense of GPU efficiency.

Next: Configure the Fabric → — switch-by-switch, the BGP underlay, QoS, PFC, ECN, and buffer config that makes the fabric lossless.

Pod sizing — the four tiers​

GPU server — the reference options​

Switches — match the pod size​

Cabling — the part that bites you​

The supporting networks you'll forget about​

Storage network​

Management network​

Out-of-band (OOB) network​

A worked example — what a 256-GPU pod actually costs​

What you should remember​