Inside GPU Anatomy — CLI Version

The previous page showed what's in the DGX H100 box. This page is what every one of those components looks like from the shell.

Seven commands, seven layers — from the GPU array down to the BMC. Every output below is what you'd see on a real DGX-class host. This is the install-time and triage-time toolkit. Bookmark it.

After this page, you'll be able to

Enumerate the box — list the 8 H100s, the 4 NVSwitches, the 8 ConnectX-7s, the 32 DIMMs, the 10 NVMe drives, and the 6 PSUs — each from its native CLI tool.
Read nvidia-smi topo -m as the rail map NCCL uses to lay down AllReduce rings.
Verify GPUDirect-RDMA prerequisites — same PCIe root for GPU + paired NIC, same NUMA node, HugePages allocated, ulimit -l unlimited.
Pre-flight a host before training touches it — DCGM diag, firmware uniformity check, BMC reachability, power telemetry, NUMA affinity sweep.
Triage a sick node from the shell alone — without a vendor support call.

The hardware terms (NVLink, NVSwitch, ConnectX-7, SXM5, HGX, BMC, PCIe Gen5) were introduced one page back. The glossary lives there — From CPU to GPU.

1. The GPU array

The first thing you do on any AI host. nvidia-smi enumerates every accelerator the kernel sees, what driver and CUDA version it's running, the HBM usage per GPU, and the power draw.

MODULE gpu-server · LAB 1Watch the recording — every command, every counter, every output.

What to spot:

Eight rows — one per H100. PCIe BDFs 1B / 43 / 51 / 9B / C2 / CA / DA / E2 are the H100s — same address pattern you'll see in lspci later.
81 559 MiB per GPU — that's 80 GB HBM3 soldered to the die. The model and the activations live here, not in DDR.
700 W cap, ~120 W idle — SXM5 H100 is a 700 W part. Liquid-cooled servers run closer to the cap under training load; air-cooled DGX H100 caps around 600–650 W under sustained AllReduce.
P0 performance state — GPUs always pinned at max clocks for training. No power-state oscillation; that's a CPU server pattern.

If nvidia-smi returns less than 8 GPUs, you have a missing device — usually a PCIe re-train glitch or a bad SXM seat. That's a dead host for training; NCCL will not silently route around a missing rank.

2. The NVLink scale-up mesh

The 8 GPUs in the box talk to each other over NVLink — a separate fabric from your Ethernet, hidden inside the chassis. Two commands show it:

MODULE gpu-server · LAB 2Watch the recording — every command, every counter, every output.

What to spot:

18 links × 26.562 GB/s = ~478 GB/s unidirectional, ~900 GB/s bidirectional per GPU. That's the "900 GB/s NVLink" you see in NVIDIA marketing — it's per-GPU aggregate, not per-link.
nvidia-smi topo -m is the rail map. Every NV18 cell means "GPU X reaches GPU Y across 18 NVLinks" — a full all-to-all mesh through the 4 NVSwitches. No GPU pair is more than one switch hop away.
PIX between a GPU and its paired NIC = same PCIe switch. That's the rail-optimized pairing: GPU 0 ↔ NIC 0, GPU 1 ↔ NIC 1, etc.
NODE vs SYS for GPU↔NIC = same NUMA half vs across the UPI link. NCCL prefers PIX > NODE > SYS and will not cross SYS for the AllReduce ring if it can help it.
CPU affinity column splits 0-55 vs 56-111 — that's the 2-socket split. GPUs 0–3 belong to socket 0; GPUs 4–7 belong to socket 1.

This matrix is the input NCCL reads to lay down rings and trees. If the matrix is wrong (BIOS misconfig, a missing PCIe switch), NCCL builds a slow topology and AllReduce throughput collapses. Always check this before declaring a host healthy.

3. The ConnectX-7 RDMA NICs (back-end fabric)

These are the back-end NICs — the 8 ConnectX-7s, one per GPU, carrying the RoCE v2 AllReduce traffic. They are distinct from the front-end NICs (storage NIC + 2× mgmt Ethernet, see the hardware page) which live on a separate switch and never carry AllReduce. lspci finds the back-end NICs; ibstat shows the RDMA-stack view.

MODULE gpu-server · LAB 3Watch the recording — every command, every counter, every output.

What to spot:

8 lines of "MT2910 Family [ConnectX-7]" — the BDFs (1B / 43 / 51 / 9B / C2 / CA / DA / E2) match the GPU PCIe roots. That's not a coincidence — each NIC is wired to the same PCIe switch as the GPU it serves, so GPUDirect RDMA can DMA straight from NIC to HBM with zero CPU touch.
mlx5_0 … mlx5_7 — the Linux RDMA stack names. NCCL and PyTorch find NICs by these names, not by eth*.
State: Active Physical state: LinkUp Rate: 400 — every port up, every port at 400 Gbps. Any port not at 400 is a fabric problem — bad transceiver, dirty fiber, switch-side speed mismatch. Same drill as 100G troubleshooting, just higher stakes.
Link layer: Ethernet — these NICs are running RoCE v2, not InfiniBand. Same silicon, different transport. The full curriculum is about this transport.

The fan-out: 8 NICs × 400 Gbps = 3.2 Tbps out of every training host. A 1,024-GPU cluster is 128 of these servers = ~410 Tbps of host-side bandwidth. That's the number your spine has to absorb.

4. PCIe & NUMA layout

The view of how the PCIe tree and the two CPU sockets carve the box into halves.

MODULE gpu-server · LAB 4Watch the recording — every command, every counter, every output.

What to spot:

Eight PCIe root complexes (0000:e0, 0000:d0, 0000:c0, 0000:90, 0000:50, 0000:40, 0000:10, plus internal) — each root anchors one or two GPU+NIC pairs. This is the rail. Cut any one root and you've taken out one rail of the cluster.
GPU + NIC on the same root — 01.0 is the NIC, 01.1 is the GPU, on the same bridge. That's the GPUDirect path: NIC writes to GPU HBM without ever touching CPU memory.
available: 2 nodes — two CPU sockets, two NUMA nodes. Node 0 has cores 0–55 + 112–167 (hyperthreads); node 1 has 56–111 + 168–223. That's 56 cores per socket × 2 threads.
node distances — 10 to self, 21 cross-socket. The 21 is what SYS means in the NVLink matrix: a cross-socket UPI hop. It's why GPUs 0–3 should never speak directly to GPUs 4–7 over PCIe — that path goes through the UPI link and adds latency. NVLink bypasses this entirely.
2 TB total DDR5 — a lot of RAM, but none of it is in the training data path. It holds the OS, the framework runtime, the dataset preprocessing buffer, and the checkpoint stage area. The model lives in HBM.

This view is the closest you can get to the physical wiring without opening the chassis. Confirms the topology diagram one page back is real: 2 sockets × 4 GPUs × (1 NIC + 1 PCIe bridge each).

5. Memory, storage & firmware

The last three pieces a cluster builder touches at install time — DIMMs, NVMe, and NIC firmware. Network engineers don't think about these day to day, but every fresh rack starts with a triage pass on all three. Mixed DIMM speeds throttle the whole socket, a missing RAID array silently bricks the host on first failure, and firmware drift across the 8 ConnectX-7s is the single most common cause of a node that "passes burn-in but won't AllReduce."

MODULE gpu-server · LAB 5Watch the recording — every command, every counter, every output.

What to spot:

32 × 64 GB DDR5-4800, all Micron — one uniform line in the uniq -c output. That's 2 TB total, all DIMMs at the same speed. If you see two speed lines (e.g. some at 4800, some at 4400), the BIOS has down-clocked the whole memory subsystem to the slowest stick. Same failure pattern as a mixed-speed SFP in a port channel.
2× Samsung 1.7T M.2 + 8× KIOXIA 3.5T U.2 — boot drives and data drives are different vendors and different form factors on purpose. TRAN=nvme on all ten — no SATA, no rotational anywhere in the box.
md0 = RAID 1 over the two M.2s; md1 = RAID 0 over the eight U.2s — boot survives a single drive loss, dataset cache does not. Any U.2 fault evicts the host from the training pool until the array is rebuilt.
ConnectX-7 firmware 28.39.2048 on mlx5_0 — and you want this exact same string on mlx5_1 through mlx5_7. mlxfwmanager --query walks all 8 NICs; eyeball the FW Version column for drift.
"No matching image found" — counterintuitive but healthy. It means the tool's bundled image is not newer than what's already flashed. A real problem reads "update required" or shows a higher PSID version.

The "so what" for cluster building: pin NIC firmware in your install playbooks and re-verify on every boot. A single ConnectX-7 that drifted a minor version after an RMA swap will train fine on a single host and quietly degrade AllReduce across the whole pod — RoCE's silent killer. And remember the RAID 0 trap on md1: it's a performance choice, not a durability one, and your runbook for "U.2 SMART warning" should be drain-the-host, not wait-and-see.

6. BMC, power & sensors

The BMC (Baseboard Management Controller) is the always-on chip on the motherboard that lives even when the host is "off" — the iLO/iDRAC/Redfish equivalent on this Supermicro-built DGX. It's how you reboot a wedged host with nobody on-site, and it's the only thing reporting power and thermals when the OS has gone silent. ipmitool is the lowest-common-denominator way to talk to it.

MODULE gpu-server · LAB 6Watch the recording — every command, every counter, every output.

What to spot:

BMC IP 10.42.18.204, static, MAC 3c:ec:ef:9d:14:7a — this is the OOB lifeline. Lives on a dedicated management subnet, completely off the RoCE data fabric. If you can't reach this address, you can't recover the host remotely; treat it as a P2 the moment it stops responding.
Instantaneous 6248 W, avg 6201 W, range 5912–6814 W — chassis under sustained training load. The 19.8 kW nameplate (6× 3.3 kW PSU) is the ceiling; ~6.2 kW is in-spec for 8× H100 at AllReduce. Spikes to 6.8 kW are the GPU power-state transients you size PDU headroom for.
6 PSUs at ~1040 W each — load is sharing evenly across the 4+2 redundant bank. Uneven sharing (one PSU at 1500 W, another at 600 W) means a failing module or a flipped PDU breaker — investigate before it cascades.
Inlet 22 °C / exhaust 39 °C, GPU 42–61 °C, fans ~6700–6800 RPM — thermal envelope healthy. 17 °C delta-T across the chassis is normal for air-cooled H100 under load; if inlet creeps past 27 °C you're fighting the row CRAC, not the server.
All 6 PSUs reporting ok — redundancy intact. Loss of 1 PSU = warning (still 5 alive, headroom thin). Loss of 2 = critical (4 PSUs = 13.2 kW, you're now under the load draw). Loss of 3 = host dies mid-training, the run is gone, and NCCL will not silently route around it.

For cluster building, three rules fall out of this output. BMC IPs belong on a dedicated OOB VLAN/subnet with its own switches and its own uplink — never bridged into the data fabric, never reachable from tenant networks. Power telemetry from dcmi power reading is what feeds rack-level capacity planning — sum the chassis averages, add 15% headroom for transients, and that's your PDU sizing. And loss of PSU redundancy is a P1 ticket the moment it happens, not the moment the host dies — because the next PSU loss takes the host with it, and you lose the training run on the way down.

7. Pre-flight diagnostics

Before a host joins a training cluster, run the same kind of pre-turn-up checks you'd run on a new uplink — except instead of show interface and LLDP neighbor, you're verifying the GPU subsystem, the hugepage pool, the locked-memory limit, and the NUMA wiring. These are the gates between "the hardware powered on" and "this host can absorb a 400G AllReduce ring without quietly tanking the whole job."

MODULE gpu-server · LAB 7Watch the recording — every command, every counter, every output.

What to spot:

dcgmi diag -r 2 is the canonical pre-flight. Every line — Denylist, NVML, CUDA, Permissions, Persistence Mode, Env, Page Retirement, Graphics Procs, Inforom, PCIe, GPU Memory, Diagnostic — must read Pass. A single Fail = do not put this host into rotation. This is your show interface for the GPU plane.
HugePages must be pre-allocated. HugePages_Total: 8192 × 2 MB = 16 GB of 2 MB hugepages locked into the pool. NCCL and the RDMA stack pin from here. Zero hugepages = RDMA pin falls back to 4 KB pages and effective bandwidth collapses — often by 30–50% — with nothing in dmesg to tell you why.
ulimit -l unlimited is the locked-memory limit. RDMA registers memory regions that must be pinned and non-swappable. Anything other than unlimited and you'll see RDMA reg_mr failed the instant a real job starts — usually long after the host has been declared healthy.
NUMA affinity script confirms the §4 wiring promise: GPU 0 + mlx5_0 on NUMA 0, GPU 4 + mlx5_4 on NUMA 1, and so on for all 8 pairs. If any GPU/NIC pair lands on different NUMA nodes, GPUDirect traffic crosses the UPI link and throughput drops 20–40% — silently, with no link-layer error.
These 4 checks together = "install-clean." Skipping any one of them is how a fabric mysteriously underperforms with no visible error to point at. The DCGM pass is hardware sanity; hugepages + ulimit -l are kernel sanity; the NUMA script is topology sanity. Miss one, debug for days.

Wire all four into your provisioning playbook — Ansible, Foreman, MAAS, whatever you use to stamp out hosts. The cost is ~30 seconds at install time; the cost of skipping is debug days later, usually after a customer training run has already burned a weekend of GPU hours. The same install gate is the right place to bake in nvidia-fabricmanager health (systemctl is-active nvidia-fabricmanager) and a mlxlink transceiver-health sweep across all 8 ports — by the time the host is offered to the scheduler, every link, every NIC, every GPU, every pinning rule has been proven, not assumed.

💡 What you should remember


🔍	`nvidia-smi topo -m` is the rail map	NCCL reads this matrix to build AllReduce rings. NV18 between every GPU pair is the all-to-all proof. `PIX` cells mark GPU↔NIC rail pairs.
📍	NUMA pinning is load-bearing	GPU, NIC, and CPU cores must share a NUMA node. Cross-socket over UPI silently steals 10–30% of AllReduce bandwidth.
🧬	Firmware drift kills clusters	`mlxfwmanager --query` across all 8 NICs must show one FW version. One drifted server tanks the whole job and leaves nothing in `dmesg`.
💾	RAID 0 on `md1` is a performance choice, not durability	Any U.2 SMART warning = drain the host, don't wait. The dataset cache has no resilience.
🪪	BMC IP is the OOB lifeline	Dedicated mgmt subnet, never on the data fabric. `ipmitool dcmi power reading` feeds rack-level PDU sizing — add 15% headroom for transients.
⚡	PSUs vote — min 3 of 6 to boot	4+2 redundancy. Loss of 1 = warning, loss of 2 = critical, loss of 3 = host dies. NCCL will not silently route around it.
🚦	Pre-flight or pay later	DCGM diag, HugePages, `ulimit -l unlimited`, NUMA affinity — gate every node on these 4 checks. 30 seconds at install vs days of debug later.

Next: AI Fabric Architecture → — the network around these servers. Spine-leaf with AI twists, rail-optimized topology, hash polarization, and what stays the same vs what changes from a traditional DC CLOS.

1. The GPU array​

2. The NVLink scale-up mesh​

3. The ConnectX-7 RDMA NICs (back-end fabric)​

4. PCIe & NUMA layout​

5. Memory, storage & firmware​

6. BMC, power & sensors​

7. Pre-flight diagnostics​

💡 What you should remember​

1. The GPU array

2. The NVLink scale-up mesh

3. The ConnectX-7 RDMA NICs (back-end fabric)

4. PCIe & NUMA layout

5. Memory, storage & firmware

6. BMC, power & sensors

7. Pre-flight diagnostics

💡 What you should remember