Debugging Tools You'll Use

Most production AI cluster debugging is "run a command, read the output." This page is the inventory — the ten or so commands you'll run all the time, what good output looks like, and what bad output points at.

The top 10, in rough order of how often you'll use them

Command	What it does	When you'll use it
`ip`	Interfaces, addresses, routes, namespaces	Constantly
`ethtool`	NIC details, link state, driver info, statistics	When a NIC misbehaves
`rdma`	RDMA device state, links, GIDs	RDMA-specific issues
`ibstat` / `ibv_devinfo`	Detailed RDMA device info	Verifying RDMA setup
`lspci`	Hardware enumeration	"Is the NIC even there?"
`numactl`	NUMA topology + binding	NUMA mismatch debugging
`dmesg`	Kernel ring buffer	Driver / hardware errors
`ss`	Socket state	TCP / management traffic
`nvidia-smi`	GPU state + topology	GPU-side issues
`perftest` (`ib_write_bw`, etc.)	RDMA microbenchmarks	Validating fabric throughput

The rest of this page is what each looks like in practice.

`ip` — interfaces, addresses, routes, namespaces

This is the everyday command. Memorize these forms:

ip -br link show              # one line per interface (super useful)
ip -br addr show              # interfaces + IPs
ip -d link show enp1s0        # detailed (driver, NUMA, etc.)
ip route show                 # routing table
ip rule show                  # routing rules (which table for which traffic)
ip neigh show                 # ARP / neighbor table
ip netns list                 # network namespaces
ip netns exec <ns> ip addr    # run inside a namespace

What good looks like: all expected interfaces UP, sensible IPs, expected routes.

What bad looks like:

Interface DOWN that should be UP → check cable, optic, switch port
IP missing → config didn't apply or got reverted
Wrong route → routing rule sending traffic the wrong way (multi-rail issue)

`ethtool` — the NIC's CLI

ethtool is to a Linux NIC what show interface is to a switch port.

ethtool enp1s0                       # link state, speed, duplex, autoneg
ethtool -i enp1s0                    # driver name, version, firmware
ethtool -S enp1s0                    # statistics — hundreds of counters
ethtool -S enp1s0 | grep -i drop     # discards / drops
ethtool -S enp1s0 | grep -i prio     # per-priority counters (PFC, DCQCN)
ethtool -k enp1s0                    # offload features
ethtool --show-coalesce enp1s0       # interrupt coalescing config

The counters that matter for RDMA hosts:

ethtool -S enp1s0 | grep -E "(rx_prio3_pause|tx_prio3_pause|out_of_buffer|out_of_sequence|np_cnp_sent|rp_cnp_handled)"

Counter	What it means
`rx_prio3_pause`	PFC pauses received on RoCE priority
`tx_prio3_pause`	PFC pauses sent (= you're congesting your upstream)
`out_of_buffer_discards`	Receive buffer overflow → fabric is broken
`out_of_sequence`	Packets arriving out of order → adaptive routing or multipath
`np_cnp_sent`	Notification Points sent (you're echoing back ECN marks)
`rp_cnp_handled`	Reaction Points handled (DCQCN dialed back rate)

rx_prio3_pause and tx_prio3_pause should be near zero in steady state. np_cnp_sent and rp_cnp_handled should be non-zero (DCQCN doing its job).

`rdma` — RDMA device state

The rdma command from iproute2-rdma is the modern RDMA-specific equivalent of ip.

rdma link                          # list all RDMA links + state
rdma resource show                 # resources in use (QPs, CQs, MRs)
rdma resource show qp link mlx5_0  # QP detail on one device
rdma statistic show                # device-level stats
rdma system show                   # global config

What good looks like:

$ rdma link
link mlx5_0/1 state ACTIVE physical_state LINK_UP netdev enp1s0np0
link mlx5_1/1 state ACTIVE physical_state LINK_UP netdev enp2s0np0
... (one per NIC)

All ACTIVE / LINK_UP. If any are DOWN or POLLING, something's wrong with that port.

`ibstat` / `ibv_devinfo` — the verbose detail

For when rdma link isn't enough:

ibstat                  # summary per device + port
ibv_devinfo             # detailed per device + port
ibv_devinfo -v          # very detailed

$ ibstat mlx5_0
CA 'mlx5_0'
        CA type: MT4129
        Number of ports: 1
        Firmware version: 28.42.1000
        Hardware version: 0
        Node GUID: 0xa088c200015b1f80
        System image GUID: 0xa088c200015b1f80
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 400
                Base lid: 0
                LMC: 0
                SM lid: 0
                Capability mask: 0x00010000
                Port GUID: 0xa288c2fffe5b1f80
                Link layer: Ethernet

Things to verify:

State: Active (not Init or Down)
Rate: 400 (expected speed; if you see 200 or 100, link is degraded)
Link layer: Ethernet (for RoCE; would be InfiniBand for IB)
Firmware version: matches across all NICs in the cluster (mixed firmware causes subtle bugs)

`lspci` — hardware enumeration

Before chasing software issues, confirm the hardware is even there:

lspci | grep -i -E "mellanox|connect|nvidia|broadcom|intel.*eth"
lspci -tv                # PCIe tree
lspci -vv -s 81:00.0     # super-detailed for one device

For a healthy 8-NIC server, you should see 8 PFs + however-many VFs:

00.0 Ethernet controller: Mellanox Technologies MT2910 Family [ConnectX-7]
00.1 Ethernet controller: Mellanox Technologies MT2910 Family [ConnectX-7]
00.2 Ethernet controller: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function
...

If a NIC is missing from lspci, it's either dead, the slot is wrong, or BIOS isn't enabling that lane.

`numactl` — NUMA awareness

numactl --hardware                       # show NUMA layout
numactl --show                           # show current process binding
cat /sys/class/net/enp1s0/device/numa_node   # which NUMA the NIC is on
nvidia-smi topo -m                       # GPU↔NIC NUMA mapping (gold)

nvidia-smi topo -m output legend:

PIX = passes a PCIe switch (same root complex, fastest)
PHB = passes a Host Bridge (different root complex on same NUMA)
NODE = different NUMA, same machine
SYS = different system (impossible inside one server, shouldn't appear)

Pair GPUs with NICs that show PIX or PHB. Anything else costs throughput.

`dmesg` — kernel log

The kernel ring buffer. Where driver errors, OOMs, hardware errors, and "weird things" show up.

dmesg | tail -100
dmesg -T | grep -i mlx       # NIC driver messages with timestamps
dmesg -T | grep -i pcie      # PCIe issues
dmesg -T | grep -i iommu     # IOMMU messages

If a NIC link flapped, mlx5 module crashed, a PCIe link error happened, dmesg has the record.

`ss` — TCP/socket state

For the eth0 management path and any TCP traffic (storage, control plane, NCCL bootstrap):

ss -tnlp           # listening TCP sockets + process
ss -tn state established       # established connections
ss -tn dst 10.5.0.10           # connections to a specific peer
ss -tin                        # with TCP info (cwnd, rtt, etc.)
ss -s                          # summary

ss replaced netstat years ago. Faster, better output, modern flags.

`nvidia-smi` — GPU state

Even though it's a GPU command, network engineers run it constantly because GPU state often explains apparent network issues.

nvidia-smi                       # status of all GPUs
nvidia-smi topo -m               # NUMA topology (GPU↔NIC affinity)
nvidia-smi nvlink -s             # NVLink status
nvidia-smi dmon                  # streaming per-second telemetry
nvidia-smi --query-gpu=index,name,temperature.gpu,utilization.gpu,memory.used --format=csv

If a GPU is throttling thermally, the training job slows down — looks like a network problem until you check nvidia-smi.

`perftest` — the RDMA microbenchmark

Not strictly a debug tool but indispensable. ib_write_bw, ib_read_bw, ib_send_bw measure throughput between two RDMA-capable hosts.

# Server side
ib_write_bw -d mlx5_0

# Client side (other host)
ib_write_bw -d mlx5_0 <server-ip>

Expected on a 400 G NIC with healthy fabric: ~370 Gbps for 1 MB+ messages. If you see less, there's a problem somewhere — drivers, QoS, cables, or fabric.

A debug workflow

When something's wrong, this is roughly the sequence:

# 1. Is the hardware there?
lspci | grep -i mellanox

# 2. Is the driver loaded?
lsmod | grep mlx5

# 3. Is the link up?
ip -br link show
rdma link

# 4. Are there errors?
dmesg -T | tail -50
ethtool -S enp1s0 | grep -i error

# 5. Are PFC / DCQCN active?
ethtool -S enp1s0 | grep -E "(prio3_pause|cnp)"

# 6. Can RDMA actually move bytes?
ib_write_bw <peer>

If step 6 hits expected throughput, the host is healthy. If anything fails earlier, you've found the layer.

What you should remember

The big 10: ip, ethtool, rdma, ibstat, lspci, numactl, dmesg, ss, nvidia-smi, perftest. Memorize them.
Counters tell the story. ethtool -S produces hundreds; you only care about ~5 (PFC, drops, CNP, out-of-sequence).
nvidia-smi topo -m is the source of truth for GPU↔NIC affinity. Get this right at deploy time.
perftest is the proof-positive. If ib_write_bw doesn't hit 95% of line rate, something is wrong.
Debug bottom-up. Hardware → driver → link → protocol → application. Each layer's tools are different.

You're done with the Linux section. Head to Kubernetes for Network Engineers → for the k8s side, or jump to Cluster Build Guide for the practical build steps.

The top 10, in rough order of how often you'll use them​

ip — interfaces, addresses, routes, namespaces​

ethtool — the NIC's CLI​

rdma — RDMA device state​

ibstat / ibv_devinfo — the verbose detail​

lspci — hardware enumeration​

numactl — NUMA awareness​

dmesg — kernel log​

ss — TCP/socket state​

nvidia-smi — GPU state​

perftest — the RDMA microbenchmark​

A debug workflow​

What you should remember​