Inside a GPU Server

You've worked with servers your whole career — but an AI training server is built differently. 4–8 GPUs on a baseboard, a TB/s fabric inside connecting them, and 1–8 RDMA NICs exposing them to the rest of the cluster. Here's what's in the box, what each piece does, and which vendors you'll see.

Anatomy of a GPU server. Four GPUs at the top (H100 / MI300X) connected by NVLink to an NVSwitch (scale-up fabric, TB/s). Two RDMA NICs at the bottom (ConnectX-7, Thor, E810) connected to the GPUs through PCIe lanes. The NICs connect outward to a ToR switch (Spectrum-X, Arista 7060X, Tomahawk, Nexus 9300) — the start of the scale-out fabric. — A typical AI training server. NVLink fabric inside the box at TB/s; RDMA over Ethernet outside the box at μs.

Inside a GPU server

A modern AI training server packages 4–8 GPUs on a baseboard (HGX / MGX / OAM), connects them to each other through a scale-up fabric, and exposes them to the rest of the cluster through one or more RDMA NICs.

GPUs — the compute

The current training-class GPUs:

NVIDIA — H100 (Hopper, 2022), H200 (more HBM3e, 2024), B100 / B200 (Blackwell, 2024–25), GB200 (Blackwell + Grace CPU)
AMD — MI300X (CDNA 3, 2023), MI325X (2024), MI350 / MI400 (roadmap)
Intel — Gaudi 3 (2024)
Google — TPU v5p, Trillium (cloud-only, you don't put these in a server you own)

NVIDIA is the dominant vendor in 2026 by a wide margin. AMD's MI300X is the credible second source — Microsoft, Meta, and Oracle have all bought it. Gaudi sits behind.

NVLink, NVSwitch — the scale-up fabric

Inside the box, every GPU talks to every other GPU at TB/s. NVLink is the wire; NVSwitch is the silicon that makes it a fabric (not just point-to-point links).

NVLink (NVIDIA, proprietary) — 1.8 TB/s per GPU on B100 / B200
UALink (AMD, Broadcom, Cisco, Google, HPE, Intel, Meta, Microsoft, 2025) — open alternative to NVLink, v1.0 shipped
SUE (Broadcom Scale-Up Ethernet) — open, simpler than UET; ≤1.6 Tbps, ~100 ns device latency
ICI (Google) — TPU-only

A GB200 NVL72 puts 72 GPUs on a single NVLink Switch fabric — that's one logical machine over scale-up. Scale-out (RDMA) only takes over at the rack boundary.

RDMA NICs — the scale-out fabric starts here

NICs connect the server to the rest of the cluster. Production AI servers typically have 1 NIC per GPU ("rail-optimized") — so 8 GPUs = 8 NICs.

NVIDIA ConnectX-7 (400G), ConnectX-8 (800G, 2024)
NVIDIA BlueField-3 — DPU (NIC + ARM cores + offload engines; runs the network stack on the card)
Broadcom Thor (400G), Thor2 (800G)
Intel E810 (200G), E2100 IPU (400G; co-developed with Google to run Falcon)

These run RoCE v2 by default — the transport this curriculum picked back in Section 1.

CPUs, memory, storage NICs

The non-GPU pieces, briefly:

CPUs — dual AMD EPYC (Genoa, Bergamo, Turin) or Intel Xeon (Sapphire Rapids, Emerald Rapids, Granite Rapids)
RAM — 1–4 TB per server
Storage NICs — sometimes separate NICs (100G/200G) for the storage path (reading training data, writing checkpoints) to keep the training fabric clean. Often the same ConnectX silicon but a smaller speed grade.

Worked example — NVIDIA DGX H100

The reference server. 8× H100 GPUs (SXM5) on a baseboard, 4× NVSwitch chips wiring them in an all-to-all NVLink mesh, 8× ConnectX-7 NICs (one per GPU, rail-optimized), 2× Intel Xeon Platinum CPUs, 2 TB DDR5. Eight 400G ports out the back into the fabric.

Most non-NVIDIA reference designs (Microsoft Maia, Meta MTIA hosts, OCP MGX) follow the same shape: 4 or 8 GPUs, 1 NIC per GPU, dual CPUs.

The rack — top-of-rack (ToR) switches

Servers don't talk to the fabric directly. Each rack has one or more ToR switches that aggregate the NICs.

Common ToR vendors for AI fabrics:

NVIDIA Spectrum-4 — purpose-built AI-Ethernet ASIC, anchors the Spectrum-X reference design (Spectrum + ConnectX + BlueField, all NVIDIA)
Arista 7060X / 7800 — workhorse AI-fabric switches (Meta, Oracle, multiple AI clouds)
Cisco Nexus 9300 / 9400 — Cisco's AI infrastructure line
Broadcom Tomahawk-based — Tomahawk 4/5 silicon in white-box switches from Edgecore, Celestica, Wedge (Meta's open OCP designs)
Juniper QFX — less common in AI deployments but present in some clouds

Port speeds in 2026 are mostly 400G or 800G, with 1.6T on the near horizon. A single 64×400G ToR can fan out 32 servers easily.

Oversubscription — most AI fabrics are 1:1 (uplink bandwidth = downlink bandwidth). Compare to traditional DC where 4:1 or higher is normal. AI fabrics can't tolerate the head-of-line blocking that oversubscription creates under collective bursts.

The fabric — spine-leaf and rails

Beyond the rack, multiple ToRs aggregate up through spine switches. Two common patterns:

Spine-leaf (classic CLOS) — every leaf connects to every spine. Each GPU NIC can reach any other GPU through one spine hop. Simple, well-understood, works for thousands of GPUs.

Rail-optimized — each GPU rail (GPU 0, GPU 1, … GPU 7) has its own dedicated leaf + spine pair. GPU 0 on every server connects to "rail 0 leaf," GPU 1 to "rail 1 leaf," and so on. Less cross-rail traffic, smaller blast radius per rail. The dominant pattern for 10K+ GPU clusters.

Pod sizing — a "pod" is a self-contained training fabric, typically 256–2048 GPUs. Larger jobs stitch pods together across super-spines (a third tier).

We'll do topology + sizing in its own section. For now: a GPU server lives in a rack, the rack has ToRs, the ToRs aggregate through spines, and the GPUs you care about may be 1 hop or 3 hops away depending on which rail/pod they're on.

Vendor matrix at a glance

Layer	NVIDIA-stack	Open / multi-vendor
GPU	H100, H200, B100, B200	AMD MI300X / MI325X, Intel Gaudi 3
Scale-up	NVLink + NVSwitch	UALink (2025+), SUE (Broadcom)
RDMA NIC	ConnectX-7 / 8, BlueField-3	Broadcom Thor / Thor2, Intel E810 / E2100
ToR	Spectrum-4	Arista 7060X / 7800, Cisco Nexus 9300, Tomahawk-based (Edgecore, Celestica)
Reference design	DGX SuperPOD	OCP MGX, Microsoft Maia, Meta Grand Teton

Most production AI clusters are a mix: NVIDIA GPUs + NVIDIA NICs + open switching (Arista or Tomahawk-based). The "all-NVIDIA" Spectrum-X stack is newer and growing share but isn't dominant yet.

Next: more sections incoming — RDMA & Verbs (how GPUs talk to each other), AI Fabric Architecture (the network shape around them), Building an AI Cluster (deployment). For now, head back to the curriculum index or revisit Transport & CC.