Skip to main content

Inside GPU Anatomy — Hardware

You know why GPUs replaced CPUs (the previous page). Now: open the box.

Two pages. This one is the hardware view — vendors, the dual-tray architecture, every component in the chassis, and how the rack/pod wraps around it. The next page is the same machine viewed from the shell — every command you'd run to verify, triage, or pre-flight a host.

After this page, you'll be able to
  1. Name the GPU vendors and their flagships — NVIDIA, AMD, Intel, Google — and know who's actually shipping in production.
  2. Walk every component of a DGX H100 — GPU, HBM, NVSwitch, ConnectX-7, PCIe, CPUs, cooling — and know what each one does to your fabric.
  3. Articulate the "scale-up vs scale-out" split — NVLink at TB/s inside the box, RoCE at 400 Gbps outside the box — and where the boundary lands.
  4. Recite the 1 NIC per GPU rule — rail-optimized topology lives or dies by this.

Hardware terms (NVLink, NVSwitch, SXM5, HGX, PCIe Gen5, rail-optimized) are introduced inline below. The full glossary lives one page back — From CPU to GPU.


1. The GPU landscape — who makes them

The vendor map in 2026:

NVIDIA

The default everyone copies

  • H1002022
  • H2002024
  • B100 · B2002024–25 · Blackwell
  • GB200Blackwell + Grace
  • B300Blackwell Ultra · FY27

The reference. Any DGX-class server you'll see in production is NVIDIA. NVLink + ConnectX + Spectrum-X is the all-NVIDIA stack.

AMD

The CIO's hedge

  • MI300X2023
  • MI325X2024
  • MI350Roadmap
  • MI400Roadmap

Microsoft, Meta, and Oracle have all bought MI300X. Backs UALink as the open alternative to NVLink.

Intel

Niche in Intel-stack shops

  • Gaudi 32024 · Habana

Habana-acquired silicon. Niche in customers committed to Intel CPUs + accelerators. Limited training-cluster footprint.

Google

The vertical that doesn't sell silicon

  • TPU v5p2023
  • Trillium2024 · TPU v6

You don't put these in a server you own. ICI is the TPU equivalent of NVLink. Internal Google + GCP customers only.

What this means for a network engineer:

  • Any DGX-class server you'll see in production is NVIDIA H100/H200/B100/B200. AMD MI300X clusters exist but are rarer.
  • The NIC and switch story tracks the same way — NVIDIA ConnectX + Spectrum for the all-NVIDIA stack, Broadcom Thor + Tomahawk or Arista 7060X / 7800 for everything else.
  • Don't assume "NVIDIA GPU = NVIDIA fabric." Most production deployments are NVIDIA GPUs + NVIDIA NICs + open switching (Arista or Tomahawk). The pure Spectrum-X stack is newer and growing share but isn't dominant yet.

2. Inside one GPU server — the DGX H100 reference

The DGX H100 is the reference design. 8× H100 GPUs, 4× NVSwitch chips, 8× ConnectX-7 NICs, 2× Intel Xeon CPUs, 2 TB DDR5, eight 400G ports out the back. Most non-NVIDIA reference designs (OCP MGX, Microsoft Maia hosts, Meta Grand Teton) follow the same shape: 4 or 8 GPUs, 1 NIC per GPU, dual CPUs.

The official NVIDIA topology

NVIDIA's own diagram of the exact wiring. Two CPUs at the top, eight GPUs in the middle paired 1:1 with eight ConnectX-7 NICs, and four NVSwitch chips at the bottom forming the all-to-all NVLink mesh between every GPU:

DGX H100/H200 system topology — two CPUs at the top each connected through PCIe switches to four ConnectX-7 NICs and four H100/H200 GPUs (eight GPUs total). The eight GPUs connect to four NVSwitch chips at the bottom via NVLink, forming an all-to-all scale-up mesh. Each CPU also connects to ConnectX-7 network modules, NVMe storage, and 100 GbE management. The legend identifies ConnectX-7 (blue), ConnectX-7 Network Module (orange), NVMe (green), PCIe Switches (purple), NVSwitch (teal), PCIe (purple lines), 100 GbE (magenta), and CPU communication (light blue).

Source: NVIDIA DGX H100/H200 User Guide — System Topology. Diagram © NVIDIA Corporation, used here for educational purposes with attribution.

Component by component — the cluster builder's reference

Everything that goes into one DGX-class node. Use this table when you're sizing power, ordering parts, or sanity-checking what a vendor handed you.

#ComponentSpecRole · what to watch
1🎮8× H100 SXM5 GPU132 SMs × 128 CUDA cores · 700 W cap · SXM5 module on the HGX baseboardThe compute. All 8 must enumerate. ECC=0 at boot. Pinned at P0 during training (no power-state oscillation).
2🔥HBM3 — 8× 80 GB~3 TB/s per stack · 640 GB total · soldered to the dieWhere the model and activations live. Per-GPU memory must read 81 559 MiB free before NCCL starts.
3🌀4× NVSwitch (4th-gen)Non-blocking crossbar · sits between the two GPU rowsThe scale-up fabric. nvidia-smi topo -m must show NV18 between every GPU pair — that's the proof the mesh is whole. Internal only; never touches your Ethernet fabric.
4🔗NVLink 4.018 links per GPU · 26.562 GB/s per link · 900 GB/s bidirectional per GPUThe wires inside the GPU tray. Bypasses CPU entirely. The reason a GPU can read another GPU's HBM at memory-bus speeds.
5📡8× ConnectX-7 RDMA NIC (back-end fabric)400 Gbps · RoCE v2 by default · 1 NIC per GPU (rail-optimized)The back-end / scale-out fabric — the AllReduce path. 8 × 400 = 3.2 Tbps host fan-out. Each NIC sits on the same PCIe root as its paired GPU so GPUDirect can DMA straight into HBM. Lands on a dedicated Compute ToR — never shared with front-end.
6🔌4× OSFP cage (rear panel)Each OSFP carries 2× 400 G ConnectX-7 lanesThe physical ports out the back of the chassis. 4 cages × 2 = 8 NICs. Any port not at 400 G = bad transceiver, dirty fibre, or switch-side speed mismatch — investigate.
7🧠2× Intel Xeon Platinum 8480C56 cores each (112 total) · DDR5 + PCIe Gen5 host controllerBoot, scheduling, ingest, dataset preprocessing — NOT in the AllReduce data path. One CPU per NUMA half; Linux pins each GPU+NIC pair to its socket.
8💾32× DDR5 DIMM — 2 TB total16 DIMMs per socket · DDR5-4800 typicalOS, framework runtime, data loader, checkpoint stage area. The model does not live here — that's HBM's job.
9🚌PCIe Gen5 switches & lanes×16 = 128 GB/s per link · 8 PCIe root complexes (one per GPU+NIC pair)The CPU↔GPU + GPU↔NIC control substrate. lspci -tv must show GPU + paired NIC on the same root. If not, GPUDirect can't shortcut the CPU and AllReduce throughput collapses.
10💿2× M.2 NVMe — OS / boot1.92 TB each · RAID 1 · NVMeOS, kernel, drivers, container runtime. Failure path = host evicted from the training pool. Check mdadm health on every boot.
11📀8× U.2 NVMe — data cache3.84 TB each · RAID 0 · ~30 TB totalActive dataset cache. RAID 0 = any drive loss wipes the whole cache. The read bandwidth feeds the training data loader at GPU pace.
12🌐Storage NIC (front-end fabric)100–200 Gbps · ConnectX silicon at a lower speed grade · 1–2 portsThe front-end fabric for storage I/O. Reads training data, writes checkpoints. Kept off the RoCE back-end so storage bursts can't mix with AllReduce. Lands on a separate front-end ToR / storage VRF.
13🛜2× dual-port management Ethernet (front-end)10–25 Gbps · in-bandThe front-end fabric for orchestration: Kubernetes / Slurm, monitoring scrape, log shipping. Not in the RoCE data path. Same front-end ToR as the storage NIC, often a different VLAN.
14🪪BMC — 1× GbE RJ45 (OOB)Always-on · Redfish / IPMI · power & sensor controlThe out-of-band lifeline. How you reboot a wedged host with nobody on-site. One per chassis — do not lose it on the cabling diagram. Dedicated OOB VLAN, never bridged into the data fabrics.
156× 3.3 kW PSU19.8 kW total · 4+2 redundant · pulls from 2 separate PDUsPower. Will not POST below 3 PSUs alive — a common first-boot surprise. Plan PDU capacity for sustained load + headroom.
16🌡️Cooling10.2 kW typical heat load · air-cooled on DGX H100 · liquid for dense racksFirst-class fabric design constraint. B200 / B300 designs are liquid-cooled by default — your rack PDU + CDU capacity caps the AI hosts per row long before the spine does.

How to read this table: rows 1–6 are the GPU tray (the engine room). Rows 7–14 are the motherboard tray + management plane. Rows 15–16 are the chassis-level plant that determines how many of these nodes you can fit in a row.

If you're sizing a cluster, the load-bearing numbers are 3.2 Tbps host fan-out (row 5), 128 GB/s PCIe Gen5 per GPU (row 9), and 10.2 kW per node (rows 15–16). Everything else is consequence.

The dual-tray block diagram

Same machine, viewed as a system block diagram. The 8U chassis splits into a GPU tray (the engine room) and a motherboard tray (the brain). The rear panel is grouped by fabric: power · back-end (the OSFP cages → 8 ConnectX-7s carrying RoCE) · front-end (storage NIC + mgmt Ethernet over TCP/IP) · out-of-band (BMC). NVLink stays inside the GPU tray; PCIe Gen5 crosses between trays; the OSFP cages hand the 8 back-end NICs out to the Compute ToR while the front-end NICs land on a separate switch entirely:

NVIDIA DGX H100 dual-tray block diagram. Top GPU tray (4U): 8 H100 SXM5 GPUs in a single row with HBM3 stacks, and 4 NVSwitches in a band beneath, forming the NVLink 4.0 all-to-all mesh at 900 GB/s per GPU — the scale-up fabric. PCIe Gen5 ×16 bus between trays carries 128 GB/s of North-South traffic. Bottom motherboard tray (4U): 2 Intel Xeon 8480C CPUs with a UPI link, 32 DDR5 DIMMs (16 per socket, 2 TB total), 2× M.2 NVMe drives in RAID 1 for OS, and 8× U.2 NVMe drives in RAID 0 for the dataset cache. Right-side rear panel, grouped by fabric: a Power section with 6× 3.3 kW PSUs in a 4+2 redundant configuration (minimum 3 active to boot). A Back-end Fabric section with 4× OSFP cages fanning out to 8× ConnectX-7 NICs at 400 Gbps each (3.2 Tbps host fan-out, RoCE v2, AllReduce path). A Front-end Fabric section with a Storage NIC (100/200 Gbps for dataset reads and checkpoint writes) and 2× dual-port management Ethernet (10/25 Gbps for Kubernetes/Slurm orchestration, log scrape, monitoring). An Out-of-Band section with the BMC GbE RJ45 (IPMI/Redfish, always-on, separate VLAN). A GPUDirect RDMA arrow connects the GPU tray directly to the back-end OSFP cages, bypassing the CPUs.
The 8U DGX H100 split into its two trays. Rear panel grouped by fabric: power · back-end (RoCE) · front-end (TCP/IP storage + mgmt) · OOB (BMC).

Two flows to read off this diagram:

  • East-West (GPU ↔ GPU): NVLink 4.0 at 900 GB/s. Stays inside the GPU tray. CPUs never touch this traffic. This is the scale-up fabric.
  • North-South (CPU ↔ GPU): PCIe Gen5 ×16 at 128 GB/s. Crosses between trays. Used for control, OS, scheduling, and dataset staging — not for the AllReduce data path.
  • Outbound (GPU → cluster): GPUDirect RDMA via ConnectX-7. The NIC writes straight to GPU HBM, bypassing the CPU entirely. This is the scale-out fabric — the part of the network you're going to spend the next 50 pages on.

Want to see this same machine from the shell? Every component above prints something at the CLI. The CLI walkthrough — nvidia-smi, NVLink mesh, lspci, ibstat, ipmitool, dcgmi, and the pre-flight checks — lives on its own page: Inside GPU Anatomy — CLI Version →.


3. Three fabrics in one server — scale-up, back-end, front-end

A DGX-class server attaches to three completely separate networks at once. Most curriculum content fixates on the back-end (the RoCE fabric carrying AllReduce), but a cluster builder owns all three. They share a chassis. They do not share switches.

FabricWhat it carriesWhere it livesHardware in this box
Scale-upGPU ↔ GPU exchanges at the chip level. AllReduce data movement inside the box.Entirely internal to the chassisNVLink 4.0 mesh through 4× NVSwitches · 900 GB/s per GPU
Back-end (a.k.a. scale-out or compute fabric)The portion of the AllReduce that crosses between hosts. RoCE v2. Microsecond latency, 0% loss tolerance.Out the rear panel, onto a dedicated spine-leaf fabric8× ConnectX-7 NICs · 400 Gbps each · 1 per GPU (rail-optimized) · 3.2 Tbps host fan-out
Front-end (the "everything else" fabric)Storage I/O (dataset reads, checkpoint writes), Kubernetes / Slurm orchestration, log scrape, monitoring, BMC OOB. Standard TCP/IP.Out the rear panel, onto a separate DC switchStorage NICs (100/200 G · ConnectX silicon, lower grade) + 2× mgmt Ethernet (10/25 G) + 1× BMC RJ45 (1 G OOB)

The rule: the back-end and front-end fabrics never share a switch, a VLAN, or an uplink. The reason is microseconds — a storage burst or a Prometheus scrape mixed onto the RoCE fabric eats headroom that AllReduce was counting on, and tail latency blows up. Vendors enforce this with separate ToRs. The diagram makes that visible:

Three fabrics view of an 8-GPU AI training server. Top: 8 H100 GPUs connect upward into the NVLink mesh through 4 NVSwitches — the scale-up fabric at 900 GB/s per GPU, staying entirely inside the box. Middle-left: each GPU pairs via PCIe Gen5 with one of 8 ConnectX-7 back-end NICs at 400 Gbps each — the back-end / scale-out fabric carrying RoCE v2 AllReduce traffic. Middle-right: a smaller set of front-end NICs (storage at 100 or 200 Gbps, in-band management Ethernet at 10 or 25 Gbps, and the BMC out-of-band RJ45) carries TCP/IP traffic for dataset reads, checkpoint writes, Kubernetes or Slurm orchestration, and out-of-band lights-out access. Outside the dashed server boundary: two separate switches — a Compute ToR aggregating the 8 back-end NICs at 3.2 Tbps host fan-out, and a much smaller Front-end ToR carrying the storage and management traffic. The two fabrics never share switches or VLANs.
Three fabrics, one server. Scale-up stays in the box. Back-end carries the AllReduce. Front-end carries everything else. The Compute ToR and the Front-end ToR are separate switches — never the same VLAN, never the same uplink.

How to read the diagram:

  • Scale-up (amber) — NVLink mesh band at the top. 8 GPUs talk through 4 NVSwitches at 900 GB/s each. No CPU, no Ethernet, no switch outside the chassis. This traffic literally cannot leave the box.
  • Back-end (teal) — 8 ConnectX-7 NICs, one per GPU, paired via PCIe Gen5. Lines drop to a wide Compute ToR that aggregates 3.2 Tbps from this single host. This is where rail-optimized spine-leaf design starts. The RoCE v2 fabric the rest of this curriculum is about.
  • Front-end (indigo) — A smaller set of NICs on the right: storage, mgmt Ethernet, BMC. Drops to a separate, smaller Front-end ToR. Standard TCP/IP. Different switches, different VRF, often different cabling color.

The most common cluster-build mistake is co-mingling the two fabrics on the same leaf switch "to save ports." It works in lab. It collapses under real load. Two physical fabrics, always.


4. Beyond the box — rack and pod

A GPU server lives in a rack. Each rack has one or more ToR switches (NVIDIA Spectrum-4, Arista 7060X/7800, Cisco Nexus 9300, Tomahawk-based whiteboxes) aggregating the NICs. Beyond the rack, ToRs connect up through spine switches in either a classic spine-leaf CLOS or a rail-optimized pattern where each GPU rail has its own dedicated leaf+spine pair (the dominant pattern for 10K+ GPU clusters).

A pod is a self-contained training fabric, typically 256–2,048 GPUs. Larger jobs stitch pods together across super-spines — a third tier. Topology + sizing is covered in depth in AI Fabric Architecture.


💡 What you should remember

🏭NVIDIA dominant, AMD secondH100 / H200 / B100 / B200 on top. MI300X is the credible second source.
📦Two trays, one serverGPU tray (8 GPUs + NVSwitches + HBM) sits above the motherboard tray (CPUs, NICs, BMC, PSUs). Know which tray every component lives on.
🌀NVLink stays inside; RDMA goes outside900 GB/s NVLink mesh between GPUs in the box; 400 Gbps RoCE NIC per GPU to the fabric.
🧵Three fabrics, one serverScale-up (NVLink, in-box) · back-end (RoCE v2 from the 8 ConnectX-7s — the AllReduce path) · front-end (storage + mgmt + BMC over TCP/IP). Back-end and front-end use separate switches — never share a VLAN.
🎯1 NIC per GPU is the ruleBack-end rail-optimized topology lives or dies by this — 8 GPUs = 8 ConnectX-7 NICs, each on the matching NUMA half. Front-end NICs are 1–2 separate cards, on a different ToR.
🪪BMC is the lifeline; PSUs voteOOB BMC lets you recover a dead node without a truck roll. 4+2 PSU redundancy — you need at least 3 of 6 PSUs alive or the node won't boot.
🔌CPUs aren't in the data pathThey boot, schedule, and stage. The AllReduce never goes through the kernel — RDMA + GPUDirect bypass them.
🔥One DGX H100 = 10.2 kWPer server. Liquid cooling becomes mandatory at B200 scale. Power and cooling are now first-class fabric design constraints.

Next: Inside GPU Anatomy — CLI Version → — the same machine viewed from the shell. nvidia-smi, NVLink mesh, lspci, ibstat, ipmitool, dcgmi — every component, one command at a time.