Debugging Tools You'll Use
Most production AI cluster debugging is "run a command, read the output." This page is the inventory — the ten or so commands you'll run all the time, what good output looks like, and what bad output points at.
The top 10, in rough order of how often you'll use them
| Command | What it does | When you'll use it |
|---|---|---|
ip | Interfaces, addresses, routes, namespaces | Constantly |
ethtool | NIC details, link state, driver info, statistics | When a NIC misbehaves |
rdma | RDMA device state, links, GIDs | RDMA-specific issues |
ibstat / ibv_devinfo | Detailed RDMA device info | Verifying RDMA setup |
lspci | Hardware enumeration | "Is the NIC even there?" |
numactl | NUMA topology + binding | NUMA mismatch debugging |
dmesg | Kernel ring buffer | Driver / hardware errors |
ss | Socket state | TCP / management traffic |
nvidia-smi | GPU state + topology | GPU-side issues |
perftest (ib_write_bw, etc.) | RDMA microbenchmarks | Validating fabric throughput |
The rest of this page is what each looks like in practice.
ip — interfaces, addresses, routes, namespaces
This is the everyday command. Memorize these forms:
ip -br link show # one line per interface (super useful)
ip -br addr show # interfaces + IPs
ip -d link show enp1s0 # detailed (driver, NUMA, etc.)
ip route show # routing table
ip rule show # routing rules (which table for which traffic)
ip neigh show # ARP / neighbor table
ip netns list # network namespaces
ip netns exec <ns> ip addr # run inside a namespace
What good looks like: all expected interfaces UP, sensible IPs, expected routes.
What bad looks like:
- Interface DOWN that should be UP → check cable, optic, switch port
- IP missing → config didn't apply or got reverted
- Wrong route → routing rule sending traffic the wrong way (multi-rail issue)
ethtool — the NIC's CLI
ethtool is to a Linux NIC what show interface is to a switch port.
ethtool enp1s0 # link state, speed, duplex, autoneg
ethtool -i enp1s0 # driver name, version, firmware
ethtool -S enp1s0 # statistics — hundreds of counters
ethtool -S enp1s0 | grep -i drop # discards / drops
ethtool -S enp1s0 | grep -i prio # per-priority counters (PFC, DCQCN)
ethtool -k enp1s0 # offload features
ethtool --show-coalesce enp1s0 # interrupt coalescing config
The counters that matter for RDMA hosts:
ethtool -S enp1s0 | grep -E "(rx_prio3_pause|tx_prio3_pause|out_of_buffer|out_of_sequence|np_cnp_sent|rp_cnp_handled)"
| Counter | What it means |
|---|---|
rx_prio3_pause | PFC pauses received on RoCE priority |
tx_prio3_pause | PFC pauses sent (= you're congesting your upstream) |
out_of_buffer_discards | Receive buffer overflow → fabric is broken |
out_of_sequence | Packets arriving out of order → adaptive routing or multipath |
np_cnp_sent | Notification Points sent (you're echoing back ECN marks) |
rp_cnp_handled | Reaction Points handled (DCQCN dialed back rate) |
rx_prio3_pause and tx_prio3_pause should be near zero in steady state. np_cnp_sent and rp_cnp_handled should be non-zero (DCQCN doing its job).
rdma — RDMA device state
The rdma command from iproute2-rdma is the modern RDMA-specific equivalent of ip.
rdma link # list all RDMA links + state
rdma resource show # resources in use (QPs, CQs, MRs)
rdma resource show qp link mlx5_0 # QP detail on one device
rdma statistic show # device-level stats
rdma system show # global config
What good looks like:
$ rdma link
link mlx5_0/1 state ACTIVE physical_state LINK_UP netdev enp1s0np0
link mlx5_1/1 state ACTIVE physical_state LINK_UP netdev enp2s0np0
... (one per NIC)
All ACTIVE / LINK_UP. If any are DOWN or POLLING, something's wrong with that port.
ibstat / ibv_devinfo — the verbose detail
For when rdma link isn't enough:
ibstat # summary per device + port
ibv_devinfo # detailed per device + port
ibv_devinfo -v # very detailed
$ ibstat mlx5_0
CA 'mlx5_0'
CA type: MT4129
Number of ports: 1
Firmware version: 28.42.1000
Hardware version: 0
Node GUID: 0xa088c200015b1f80
System image GUID: 0xa088c200015b1f80
Port 1:
State: Active
Physical state: LinkUp
Rate: 400
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x00010000
Port GUID: 0xa288c2fffe5b1f80
Link layer: Ethernet
Things to verify:
- State: Active (not Init or Down)
- Rate: 400 (expected speed; if you see 200 or 100, link is degraded)
- Link layer: Ethernet (for RoCE; would be InfiniBand for IB)
- Firmware version: matches across all NICs in the cluster (mixed firmware causes subtle bugs)
lspci — hardware enumeration
Before chasing software issues, confirm the hardware is even there:
lspci | grep -i -E "mellanox|connect|nvidia|broadcom|intel.*eth"
lspci -tv # PCIe tree
lspci -vv -s 81:00.0 # super-detailed for one device
For a healthy 8-NIC server, you should see 8 PFs + however-many VFs:
81:00.0 Ethernet controller: Mellanox Technologies MT2910 Family [ConnectX-7]
81:00.1 Ethernet controller: Mellanox Technologies MT2910 Family [ConnectX-7]
81:00.2 Ethernet controller: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function
...
If a NIC is missing from lspci, it's either dead, the slot is wrong, or BIOS isn't enabling that lane.
numactl — NUMA awareness
numactl --hardware # show NUMA layout
numactl --show # show current process binding
cat /sys/class/net/enp1s0/device/numa_node # which NUMA the NIC is on
nvidia-smi topo -m # GPU↔NIC NUMA mapping (gold)
nvidia-smi topo -m output legend:
PIX= passes a PCIe switch (same root complex, fastest)PHB= passes a Host Bridge (different root complex on same NUMA)NODE= different NUMA, same machineSYS= different system (impossible inside one server, shouldn't appear)
Pair GPUs with NICs that show PIX or PHB. Anything else costs throughput.
dmesg — kernel log
The kernel ring buffer. Where driver errors, OOMs, hardware errors, and "weird things" show up.
dmesg | tail -100
dmesg -T | grep -i mlx # NIC driver messages with timestamps
dmesg -T | grep -i pcie # PCIe issues
dmesg -T | grep -i iommu # IOMMU messages
If a NIC link flapped, mlx5 module crashed, a PCIe link error happened, dmesg has the record.
ss — TCP/socket state
For the eth0 management path and any TCP traffic (storage, control plane, NCCL bootstrap):
ss -tnlp # listening TCP sockets + process
ss -tn state established # established connections
ss -tn dst 10.5.0.10 # connections to a specific peer
ss -tin # with TCP info (cwnd, rtt, etc.)
ss -s # summary
ss replaced netstat years ago. Faster, better output, modern flags.
nvidia-smi — GPU state
Even though it's a GPU command, network engineers run it constantly because GPU state often explains apparent network issues.
nvidia-smi # status of all GPUs
nvidia-smi topo -m # NUMA topology (GPU↔NIC affinity)
nvidia-smi nvlink -s # NVLink status
nvidia-smi dmon # streaming per-second telemetry
nvidia-smi --query-gpu=index,name,temperature.gpu,utilization.gpu,memory.used --format=csv
If a GPU is throttling thermally, the training job slows down — looks like a network problem until you check nvidia-smi.
perftest — the RDMA microbenchmark
Not strictly a debug tool but indispensable. ib_write_bw, ib_read_bw, ib_send_bw measure throughput between two RDMA-capable hosts.
# Server side
ib_write_bw -d mlx5_0
# Client side (other host)
ib_write_bw -d mlx5_0 <server-ip>
Expected on a 400 G NIC with healthy fabric: ~370 Gbps for 1 MB+ messages. If you see less, there's a problem somewhere — drivers, QoS, cables, or fabric.
A debug workflow
When something's wrong, this is roughly the sequence:
# 1. Is the hardware there?
lspci | grep -i mellanox
# 2. Is the driver loaded?
lsmod | grep mlx5
# 3. Is the link up?
ip -br link show
rdma link
# 4. Are there errors?
dmesg -T | tail -50
ethtool -S enp1s0 | grep -i error
# 5. Are PFC / DCQCN active?
ethtool -S enp1s0 | grep -E "(prio3_pause|cnp)"
# 6. Can RDMA actually move bytes?
ib_write_bw <peer>
If step 6 hits expected throughput, the host is healthy. If anything fails earlier, you've found the layer.
What you should remember
- The big 10:
ip,ethtool,rdma,ibstat,lspci,numactl,dmesg,ss,nvidia-smi,perftest. Memorize them. - Counters tell the story.
ethtool -Sproduces hundreds; you only care about ~5 (PFC, drops, CNP, out-of-sequence). nvidia-smi topo -mis the source of truth for GPU↔NIC affinity. Get this right at deploy time.perftestis the proof-positive. Ifib_write_bwdoesn't hit 95% of line rate, something is wrong.- Debug bottom-up. Hardware → driver → link → protocol → application. Each layer's tools are different.
You're done with the Linux section. Head to Kubernetes for Network Engineers → for the k8s side, or jump to Cluster Build Guide for the practical build steps.