Debugging Tools You'll Use
Most production AI cluster debugging is "run a command, read the output." This page is the inventory — the ten or so commands you'll run all the time, what good output looks like, and what bad output points at.
- Reach for the right tool fast —
ip,ethtool,rdma,ibstat,lspci,numactl,dmesg,ss,nvidia-smi,perftest— and know which layer each one interrogates. - Read the RoCE counters that matter — pull
rx_prio3_pause/tx_prio3_pause,out_of_buffer,out_of_sequence,np_cnp_sent,rp_cnp_handledfromethtool -Sand say what each one means. - Verify the RDMA link end-to-end — confirm
rdma linkisACTIVE/LINK_UP, readibstatfor State/Rate/Link-layer/firmware, and map GPU↔NIC affinity withnvidia-smi topo -m(PIX/PHB/NODE/SYS). - Run the bottom-up workflow — hardware (
lspci) → driver (lsmod) → link (rdma link) → protocol (PFC/DCQCN counters) → proof (ib_write_bwnear ~370 Gbps on 400G).
Walk through a real debug session
A complete 6-step debug from "is the hardware there?" to "does RDMA actually move bytes?":
Each step rules out one layer. This is the exact rhythm you'll fall into when on-call wakes you up at 3 AM.
The top 10, in rough order of how often you'll use them
| Command | What it does | When you'll use it |
|---|---|---|
ip | Interfaces, addresses, routes, namespaces | Constantly |
ethtool | NIC details, link state, driver info, statistics | When a NIC misbehaves |
rdma | RDMA device state, links, GIDs | RDMA-specific issues |
ibstat / ibv_devinfo | Detailed RDMA device info | Verifying RDMA setup |
lspci | Hardware enumeration | "Is the NIC even there?" |
numactl | NUMA topology + binding | NUMA mismatch debugging |
dmesg | Kernel ring buffer | Driver / hardware errors |
ss | Socket state | TCP / management traffic |
nvidia-smi | GPU state + topology | GPU-side issues |
perftest (ib_write_bw, etc.) | RDMA microbenchmarks | Validating fabric throughput |
The rest of this page is what each looks like in practice.
ip — interfaces, addresses, routes, namespaces
This is the everyday command. Memorize these forms:
ip -br link show # one line per interface (super useful)
ip -br addr show # interfaces + IPs
ip -d link show enp1s0 # detailed (driver, NUMA, etc.)
ip route show # routing table
ip rule show # routing rules (which table for which traffic)
ip neigh show # ARP / neighbor table
ip netns list # network namespaces
ip netns exec <ns> ip addr # run inside a namespace
What good looks like: all expected interfaces UP, sensible IPs, expected routes.
What bad looks like:
- Interface DOWN that should be UP → check cable, optic, switch port
- IP missing → config didn't apply or got reverted
- Wrong route → routing rule sending traffic the wrong way (multi-rail issue)
ethtool — the NIC's CLI
ethtool is to a Linux NIC what show interface is to a switch port.
ethtool enp1s0 # link state, speed, duplex, autoneg
ethtool -i enp1s0 # driver name, version, firmware
ethtool -S enp1s0 # statistics — hundreds of counters
ethtool -S enp1s0 | grep -i drop # discards / drops
ethtool -S enp1s0 | grep -i prio # per-priority counters (PFC, DCQCN)
ethtool -k enp1s0 # offload features
ethtool --show-coalesce enp1s0 # interrupt coalescing config
The counters that matter for RDMA hosts:
ethtool -S enp1s0 | grep -E "(rx_prio3_pause|tx_prio3_pause|out_of_buffer|out_of_sequence|np_cnp_sent|rp_cnp_handled)"
| Counter | What it means |
|---|---|
rx_prio3_pause | PFC pauses received on RoCE priority |
tx_prio3_pause | PFC pauses sent (= you're congesting your upstream) |
out_of_buffer_discards | Receive buffer overflow → fabric is broken |
out_of_sequence | Packets arriving out of order → adaptive routing or multipath |
np_cnp_sent | Notification Points sent (you're echoing back ECN marks) |
rp_cnp_handled | Reaction Points handled (DCQCN dialed back rate) |
rx_prio3_pause and tx_prio3_pause should be near zero in steady state. np_cnp_sent and rp_cnp_handled should be non-zero (DCQCN doing its job).
rdma — RDMA device state
The rdma command from iproute2-rdma is the modern RDMA-specific equivalent of ip.
rdma link # list all RDMA links + state
rdma resource show # resources in use (QPs, CQs, MRs)
rdma resource show qp link mlx5_0 # QP detail on one device
rdma statistic show # device-level stats
rdma system show # global config
What good looks like:
$ rdma link
link mlx5_0/1 state ACTIVE physical_state LINK_UP netdev enp1s0np0
link mlx5_1/1 state ACTIVE physical_state LINK_UP netdev enp2s0np0
... (one per NIC)
All ACTIVE / LINK_UP. If any are DOWN or POLLING, something's wrong with that port.
ibstat / ibv_devinfo — the verbose detail
For when rdma link isn't enough:
ibstat # summary per device + port
ibv_devinfo # detailed per device + port
ibv_devinfo -v # very detailed
$ ibstat mlx5_0
CA 'mlx5_0'
CA type: MT4129
Number of ports: 1
Firmware version: 28.42.1000
Hardware version: 0
Node GUID: 0xa088c200015b1f80
System image GUID: 0xa088c200015b1f80
Port 1:
State: Active
Physical state: LinkUp
Rate: 400
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x00010000
Port GUID: 0xa288c2fffe5b1f80
Link layer: Ethernet
Things to verify:
- State: Active (not Init or Down)
- Rate: 400 (expected speed; if you see 200 or 100, link is degraded)
- Link layer: Ethernet (for RoCE; would be InfiniBand for IB)
- Firmware version: matches across all NICs in the cluster (mixed firmware causes subtle bugs)
lspci — hardware enumeration
Before chasing software issues, confirm the hardware is even there:
lspci | grep -i -E "mellanox|connect|nvidia|broadcom|intel.*eth"
lspci -tv # PCIe tree
lspci -vv -s 81:00.0 # super-detailed for one device
For a healthy 8-NIC server, you should see 8 PFs + however-many VFs:
81:00.0 Ethernet controller: Mellanox Technologies MT2910 Family [ConnectX-7]
81:00.1 Ethernet controller: Mellanox Technologies MT2910 Family [ConnectX-7]
81:00.2 Ethernet controller: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function
...
If a NIC is missing from lspci, it's either dead, the slot is wrong, or BIOS isn't enabling that lane.
numactl — NUMA awareness
numactl --hardware # show NUMA layout
numactl --show # show current process binding
cat /sys/class/net/enp1s0/device/numa_node # which NUMA the NIC is on
nvidia-smi topo -m # GPU↔NIC NUMA mapping (gold)
nvidia-smi topo -m output legend:
PIX= passes a PCIe switch (same root complex, fastest)PHB= passes a Host Bridge (different root complex on same NUMA)NODE= different NUMA, same machineSYS= different system (impossible inside one server, shouldn't appear)
Pair GPUs with NICs that show PIX or PHB. Anything else costs throughput.
dmesg — kernel log
The kernel ring buffer. Where driver errors, OOMs, hardware errors, and "weird things" show up.
dmesg | tail -100
dmesg -T | grep -i mlx # NIC driver messages with timestamps
dmesg -T | grep -i pcie # PCIe issues
dmesg -T | grep -i iommu # IOMMU messages
If a NIC link flapped, mlx5 module crashed, a PCIe link error happened, dmesg has the record.
ss — TCP/socket state
For the eth0 management path and any TCP traffic (storage, control plane, NCCL bootstrap):
ss -tnlp # listening TCP sockets + process
ss -tn state established # established connections
ss -tn dst 10.5.0.10 # connections to a specific peer
ss -tin # with TCP info (cwnd, rtt, etc.)
ss -s # summary
ss replaced netstat years ago. Faster, better output, modern flags.
nvidia-smi — GPU state
Even though it's a GPU command, network engineers run it constantly because GPU state often explains apparent network issues.
nvidia-smi # status of all GPUs
nvidia-smi topo -m # NUMA topology (GPU↔NIC affinity)
nvidia-smi nvlink -s # NVLink status
nvidia-smi dmon # streaming per-second telemetry
nvidia-smi --query-gpu=index,name,temperature.gpu,utilization.gpu,memory.used --format=csv
If a GPU is throttling thermally, the training job slows down — looks like a network problem until you check nvidia-smi.
perftest — the RDMA microbenchmark
Not strictly a debug tool but indispensable. ib_write_bw, ib_read_bw, ib_send_bw measure throughput between two RDMA-capable hosts.
# Server side
ib_write_bw -d mlx5_0
# Client side (other host)
ib_write_bw -d mlx5_0 <server-ip>
Expected on a 400 G NIC with healthy fabric: ~370 Gbps for 1 MB+ messages. If you see less, there's a problem somewhere — drivers, QoS, cables, or fabric.
A debug workflow
When something's wrong, this is roughly the sequence:
# 1. Is the hardware there?
lspci | grep -i mellanox
# 2. Is the driver loaded?
lsmod | grep mlx5
# 3. Is the link up?
ip -br link show
rdma link
# 4. Are there errors?
dmesg -T | tail -50
ethtool -S enp1s0 | grep -i error
# 5. Are PFC / DCQCN active?
ethtool -S enp1s0 | grep -E "(prio3_pause|cnp)"
# 6. Can RDMA actually move bytes?
ib_write_bw <peer>
If step 6 hits expected throughput, the host is healthy. If anything fails earlier, you've found the layer.
💡 What you should remember
| # | Concept | Why it matters | |
|---|---|---|---|
| 1 | 🛠️ | The big 10: | ip, ethtool, rdma, ibstat, lspci, numactl, dmesg, ss, nvidia-smi, perftest. Memorize them. |
| 2 | 📊 | Counters tell the story. | ethtool -S produces hundreds; you only care about ~5 (PFC, drops, CNP, out-of-sequence). |
| 3 | 🔌 | nvidia-smi topo -m | is the source of truth for GPU↔NIC affinity. Get this right at deploy time. |
| 4 | ⚡ | perftest is the proof-positive. | If ib_write_bw doesn't hit 95% of line rate, something is wrong. |
| 5 | 🔬 | Debug bottom-up. | Hardware → driver → link → protocol → application. Each layer's tools are different. |
You're done with the Linux section. Head to Kubernetes for Network Engineers → for the k8s side, or jump to Cluster Build Guide for the practical build steps.