Inside a GPU Server
You've worked with servers your whole career — but an AI training server is built differently. 4–8 GPUs on a baseboard, a TB/s fabric inside connecting them, and 1–8 RDMA NICs exposing them to the rest of the cluster. Here's what's in the box, what each piece does, and which vendors you'll see.
Inside a GPU server
A modern AI training server packages 4–8 GPUs on a baseboard (HGX / MGX / OAM), connects them to each other through a scale-up fabric, and exposes them to the rest of the cluster through one or more RDMA NICs.
GPUs — the compute
The current training-class GPUs:
- NVIDIA — H100 (Hopper, 2022), H200 (more HBM3e, 2024), B100 / B200 (Blackwell, 2024–25), GB200 (Blackwell + Grace CPU)
- AMD — MI300X (CDNA 3, 2023), MI325X (2024), MI350 / MI400 (roadmap)
- Intel — Gaudi 3 (2024)
- Google — TPU v5p, Trillium (cloud-only, you don't put these in a server you own)
NVIDIA is the dominant vendor in 2026 by a wide margin. AMD's MI300X is the credible second source — Microsoft, Meta, and Oracle have all bought it. Gaudi sits behind.
NVLink, NVSwitch — the scale-up fabric
Inside the box, every GPU talks to every other GPU at TB/s. NVLink is the wire; NVSwitch is the silicon that makes it a fabric (not just point-to-point links).
- NVLink (NVIDIA, proprietary) — 1.8 TB/s per GPU on B100 / B200
- UALink (AMD, Broadcom, Cisco, Google, HPE, Intel, Meta, Microsoft, 2025) — open alternative to NVLink, v1.0 shipped
- SUE (Broadcom Scale-Up Ethernet) — open, simpler than UET; ≤1.6 Tbps, ~100 ns device latency
- ICI (Google) — TPU-only
A GB200 NVL72 puts 72 GPUs on a single NVLink Switch fabric — that's one logical machine over scale-up. Scale-out (RDMA) only takes over at the rack boundary.
RDMA NICs — the scale-out fabric starts here
NICs connect the server to the rest of the cluster. Production AI servers typically have 1 NIC per GPU ("rail-optimized") — so 8 GPUs = 8 NICs.
- NVIDIA ConnectX-7 (400G), ConnectX-8 (800G, 2024)
- NVIDIA BlueField-3 — DPU (NIC + ARM cores + offload engines; runs the network stack on the card)
- Broadcom Thor (400G), Thor2 (800G)
- Intel E810 (200G), E2100 IPU (400G; co-developed with Google to run Falcon)
These run RoCE v2 by default — the transport this curriculum picked back in Section 1.
CPUs, memory, storage NICs
The non-GPU pieces, briefly:
- CPUs — dual AMD EPYC (Genoa, Bergamo, Turin) or Intel Xeon (Sapphire Rapids, Emerald Rapids, Granite Rapids)
- RAM — 1–4 TB per server
- Storage NICs — sometimes separate NICs (100G/200G) for the storage path (reading training data, writing checkpoints) to keep the training fabric clean. Often the same ConnectX silicon but a smaller speed grade.
Worked example — NVIDIA DGX H100
The reference server. 8× H100 GPUs (SXM5) on a baseboard, 4× NVSwitch chips wiring them in an all-to-all NVLink mesh, 8× ConnectX-7 NICs (one per GPU, rail-optimized), 2× Intel Xeon Platinum CPUs, 2 TB DDR5. Eight 400G ports out the back into the fabric.
Most non-NVIDIA reference designs (Microsoft Maia, Meta MTIA hosts, OCP MGX) follow the same shape: 4 or 8 GPUs, 1 NIC per GPU, dual CPUs.
The rack — top-of-rack (ToR) switches
Servers don't talk to the fabric directly. Each rack has one or more ToR switches that aggregate the NICs.
Common ToR vendors for AI fabrics:
- NVIDIA Spectrum-4 — purpose-built AI-Ethernet ASIC, anchors the Spectrum-X reference design (Spectrum + ConnectX + BlueField, all NVIDIA)
- Arista 7060X / 7800 — workhorse AI-fabric switches (Meta, Oracle, multiple AI clouds)
- Cisco Nexus 9300 / 9400 — Cisco's AI infrastructure line
- Broadcom Tomahawk-based — Tomahawk 4/5 silicon in white-box switches from Edgecore, Celestica, Wedge (Meta's open OCP designs)
- Juniper QFX — less common in AI deployments but present in some clouds
Port speeds in 2026 are mostly 400G or 800G, with 1.6T on the near horizon. A single 64×400G ToR can fan out 32 servers easily.
Oversubscription — most AI fabrics are 1:1 (uplink bandwidth = downlink bandwidth). Compare to traditional DC where 4:1 or higher is normal. AI fabrics can't tolerate the head-of-line blocking that oversubscription creates under collective bursts.
The fabric — spine-leaf and rails
Beyond the rack, multiple ToRs aggregate up through spine switches. Two common patterns:
Spine-leaf (classic CLOS) — every leaf connects to every spine. Each GPU NIC can reach any other GPU through one spine hop. Simple, well-understood, works for thousands of GPUs.
Rail-optimized — each GPU rail (GPU 0, GPU 1, … GPU 7) has its own dedicated leaf + spine pair. GPU 0 on every server connects to "rail 0 leaf," GPU 1 to "rail 1 leaf," and so on. Less cross-rail traffic, smaller blast radius per rail. The dominant pattern for 10K+ GPU clusters.
Pod sizing — a "pod" is a self-contained training fabric, typically 256–2048 GPUs. Larger jobs stitch pods together across super-spines (a third tier).
We'll do topology + sizing in its own section. For now: a GPU server lives in a rack, the rack has ToRs, the ToRs aggregate through spines, and the GPUs you care about may be 1 hop or 3 hops away depending on which rail/pod they're on.
Vendor matrix at a glance
| Layer | NVIDIA-stack | Open / multi-vendor |
|---|---|---|
| GPU | H100, H200, B100, B200 | AMD MI300X / MI325X, Intel Gaudi 3 |
| Scale-up | NVLink + NVSwitch | UALink (2025+), SUE (Broadcom) |
| RDMA NIC | ConnectX-7 / 8, BlueField-3 | Broadcom Thor / Thor2, Intel E810 / E2100 |
| ToR | Spectrum-4 | Arista 7060X / 7800, Cisco Nexus 9300, Tomahawk-based (Edgecore, Celestica) |
| Reference design | DGX SuperPOD | OCP MGX, Microsoft Maia, Meta Grand Teton |
Most production AI clusters are a mix: NVIDIA GPUs + NVIDIA NICs + open switching (Arista or Tomahawk-based). The "all-NVIDIA" Spectrum-X stack is newer and growing share but isn't dominant yet.
Next: more sections incoming — RDMA & Verbs (how GPUs talk to each other), AI Fabric Architecture (the network shape around them), Building an AI Cluster (deployment). For now, head back to the curriculum index or revisit Transport & CC.