Skip to main content

Sizing & Bill of Materials

You've understood every layer. You now have to convince finance and procurement to actually buy this thing. This page is the shopping list — pod sizes, server choices, switch choices, cables, and the supporting networks you'll forget about until they bite you.

After this page, you'll be able to
  1. Size a pod — pick from the four tiers (16 / 64 / 256 / 1024 GPUs), map it to leaf/spine counts, and justify why 256 GPUs is the enterprise sweet spot.
  2. Spec the BoM — choose a GPU server (HGX H100 from a Tier-1 OEM), pick interchangeable leaf/spine switches (Spectrum-4, Arista 7060X, Tomahawk white-box), and pick cable types (DAC <3 m, AOC <30 m, fiber 100 m+) by distance and PFC headroom.
  3. Plan the four networks — training plus the three everyone forgets: storage (100–400 GB/s), management (eth0, 25–100 G), and OOB (BMC/iDRAC, 1 G).
  4. Defend the budget — show the network is only ~5–10% of a ~$8M–$11M 256-GPU pod, so skimping there to save GPU efficiency is a bad trade.

Pod sizing — the four tiers

A pod is one self-contained training fabric, non-blocking at the rated GPU count. Most operators settle on one of these four sizes:

Pod sizeServersRails / leavesSpinesUse case
16 GPUs2 × 8-GPU0 (direct server-to-server DAC, one per rail)0Dev / single-job experiments
64 GPUs8 × 8-GPU8 (1 per rail, 8 ports each)0 (2-tier)Small team training
256 GPUs32 × 8-GPU8 (1 per rail, 32 ports each)4Production training, single pod
1024 GPUs128 × 8-GPU16 (2 per rail with super-spine)8Frontier training, multi-pod

Sizing rule of thumb: target the smallest pod that fits your largest expected job. Going bigger costs more in switches; going smaller forces multi-pod jobs across slower super-spine paths.

Pods scale in powers of 2. The 256-GPU pod is the sweet spot for most enterprise AI work — non-blocking, 3-tier topology fits cleanly, well-understood ops.


Which pod size should you pick?

Map your situation to a tier before you spec a single switch. Your job is to pick the smallest pod that fits your largest job, then adjust for your primary constraint — cost, isolation, scale, or upgrade tolerance.

I want…PickWhy
Lowest-cost dev / a few-GPU experiments16 GPUsZero leaf, zero spine — direct server-to-server DAC per rail, so you pay for no switches at all.
To train for one small team on a budget64 GPUs8 leaves, no spines — a flat 2-tier fabric that's non-blocking with the fewest switches above 16.
Production training in a single tenant256 GPUsThe enterprise sweet spot — non-blocking 3-tier (8 leaves, 4 spines) that fits cleanly with no super-spine to operate.
Smallest blast radius / rolling upgrades256 GPUsOne self-contained pod stays under your largest job, so a leaf or spine drain takes down one pod's worth of GPUs, not the whole fleet.
To host many teams without cross-tenant noiseMultiple 256-GPU podsPer-pod maintenance and isolation — each team lives in its own non-blocking pod instead of contending on shared spines.
Aggressive frontier-scale training1024 GPUs16 leaves + 8 spines + super-spine is the only tier that goes multi-pod — accept the added super-spine complexity to clear a single job past 256 GPUs.

GPU server — the reference options

Pick one. Don't mix in the same pod (NCCL hates heterogeneity).

ServerVendorGPUsNIC configNVLinkNotes
DGX H100NVIDIA8 × H100 SXM5 (80 GB)8 × ConnectX-7 (400 G)NVSwitch, 900 GB/sThe reference. Hardest to source, easiest to support.
HGX H100 / B200OEMs (Dell, HPE, Supermicro)8 × H100 or B2008 × CX-7 or CX-8 (800 G)NVSwitchSame baseboard, broader vendor choice.
MGX (modular)NVIDIA + OEMs4 / 8 GPUs, configurableCustomer choiceNVSwitchNewer; pick your CPU vendor + NIC vendor.
OCP Grand Teton (MI300X)OCP / OEMs (AMD)8 × MI300X (192 GB)8 × Broadcom Thor or CX-7 (400 G)Infinity Fabric (xGMI), ~896 GB/s–1.2 TB/sOpen Compute design; 192 GB HBM per GPU vs 80 GB. The AMD second source.
HPE Cray XDHPE (AMD)8 × MI300X (192 GB)8 × Broadcom Thor or CX-7 (400 G)Infinity Fabric (xGMI)Tier-1 OEM path for Instinct; same RoCE v2 back-end as NVIDIA.
Microsoft MaiaMicrosoft (custom)8 × Maia 100Built-in 400 GCustomAzure-only. Not for sale.
Meta Grand TetonMeta (custom)8 × H1008 × CX-7NVSwitchOpen Compute design; available through OCP.

For most operators: HGX H100 from a Tier-1 OEM is the practical choice. DGX has the cleanest software story but is supply-constrained; HGX gives you the same baseboard with broader vendor support.

If memory capacity per GPU is the constraint — large models, KV-cache-heavy serving — an MI300X platform (OCP Grand Teton or HPE Cray XD) is the credible AMD alternative: 192 GB HBM3 per GPU vs 80 GB, with the same RoCE v2 back-end fabric (Broadcom Thor or ConnectX-7 NICs). The software story is ROCm/RCCL instead of CUDA/NCCL — RCCL is a drop-in for NCCL and reuses the same NCCL_* env-var names. Don't mix vendors in one pod either way.


Switches — match the pod size

PodLeaf switchesSpine switches
16 GPUsNone (direct GPU-to-GPU on NVLink)None
64 GPUs8 × low-radix (16 × 400G each)None — leaves connect directly
256 GPUs8 × mid-radix (48 × 400G each)4 × high-radix (64 × 400G each)
1024+ GPUs16+ × high-radix8+ × high-radix + super-spines

Switch vendor choices (any of these work — pick based on existing relationships):

VendorLeaf optionSpine option
NVIDIA Spectrum-XSpectrum-4 SN5600 (64×400G)Spectrum-4 SN5600
Arista7060X6 (32×400G)7800R3 (128×400G modular)
CiscoNexus 9332D-H2R (32×400G)Nexus 9408 (128×400G)
Broadcom Tomahawk-basedEdgecore AS9716 / Celestica DS5000Same chassis, more ports
JuniperQFX5240 (64×400G)QFX5240

These are a trade-off, not a ranking — all are lossless-capable, so pick on operational fit:

  • NVIDIA Spectrum-X — adaptive routing in silicon plus NIC + switch co-tuning. Tightest integration if you're already all-NVIDIA end to end.
  • Arista R3 — open telemetry and deep buffers, strong observability story, multi-vendor NICs.
  • Broadcom Tomahawk / SONiC — the most open path: white-box hardware, your choice of NOS, no single-vendor lock-in.

Each runs the same standards on the wire (BGP, RoCE v2, PFC, ECN), so the decision is about ops fit and existing relationships, not raw capability.


Cabling — the part that bites you

Cable length and type determine PFC headroom and what your cost looks like.

TypeDistanceCost per pairUse case
DAC (Direct Attach Copper)up to 3 m~$50Within rack (server to ToR)
AOC (Active Optical Cable)up to 30 m~$300End of row, rack-to-rack
Fiber + transceiver pair100+ m~$600 (incl. 2 × 400G optics)Spine-to-leaf across the DC

Sizing math for a 256-GPU pod:

  • 32 servers × 8 NICs × DAC to leaf = 256 DAC cables in-rack
  • 8 leaves × 4 spines × 1 cable each = 32 fiber + optics between leaf and spine

Total cable cost for a 256-GPU pod: roughly $30K in cabling (a rounding error vs the GPUs but worth tracking).

Cable length affects PFC headroom. Most modern switches auto-detect. Older switches require manual configuration — and getting it wrong causes silent drops. Buy cables in the lengths you actually need; mixed-length runs make debugging harder.


The supporting networks you'll forget about

It's not just the training fabric. A real cluster has three more networks running in parallel:

Storage network

Training data lives somewhere — NVMe-over-Fabrics, Lustre, GPFS, S3-compatible object store. For a 256-GPU pod:

  • Throughput needed: depends on training step time. Often 100-400 GB/s aggregate read.
  • Topology: usually a separate spine-leaf, or shared with training fabric but on a different priority class.
  • NIC: sometimes a separate storage NIC per server (often slower — 100-200 G), or shared on the training NIC with QoS isolation.

Management network

How you SSH to servers, fetch logs, push configs.

  • Speed: 25-100 G is fine. Not performance-critical.
  • Connectivity: every server's eth0 (the in-band management interface).
  • Don't put RDMA traffic on this — it gets congested by everything else.

Out-of-band (OOB) network

The "console" network — BMC / iDRAC / iLO. How you reboot a server when it stops responding to in-band.

  • Speed: 1 G is plenty.
  • Critical for ops: without it, you walk to the datacenter.

Budget all three when you plan the pod. They each need ~2 dedicated switches at a minimum.


A worked example — what a 256-GPU pod actually costs

Order-of-magnitude only. Real prices depend on volume discounts, relationships, and current GPU supply.

Line itemApproximate cost
32 × HGX H100 servers (8 × H100, 8 × CX-7, dual Xeon, 2 TB RAM)$7.5M–$10M
8 × Spectrum-4 leaf switches (or Arista 7060X)$400K
4 × Spectrum-4 spine switches$300K
Cables (DAC + AOC + fiber + optics)$30K
Storage + management + OOB networks$200K
Power + cooling + rack + cabling labor$200K
Total infrastructure~$8M–$11M

Most of the cost is the GPUs themselves (~$30K each × 256 = $7.5M). The network is roughly 5–10% of the total. That's why "skimping on the network" makes no sense — a 10% network savings to lose 20% of GPU efficiency is a terrible trade.


💡 What you should remember

#ConceptWhy it matters
1🔢Pods scale in powers of 2.64 / 256 / 1024 are the sweet spots.
2🖥️Pick one server design and don't mix.NCCL doesn't love heterogeneous clusters.
3🔁Switch vendors are interchangeable at the protocol level(BGP, RoCE v2, PFC, ECN are all standards). Pick on operational fit, not features.
4🔌Cable length affects PFC headroom.Modern switches auto-detect; older ones need manual config.
5🌐You're really buying four networks:training, storage, management, OOB. Plan for all of them.
6💰The network is ~5–10% of pod cost.Don't optimize it down at the expense of GPU efficiency.

Next: Configure the Fabric → — switch-by-switch, the BGP underlay, QoS, PFC, ECN, and buffer config that makes the fabric lossless.