Skip to main content

RoCE v2 Operator Cheatsheet

Single-page reference card. Open it in a second tab while you work on a real RoCE host. No narrative — every section is a table of commands or a code block you can paste.

Your job here is to stop guessing and start typing. Each section answers one operational question: "what's installed", "how is the NIC configured", "what's the QoS chain look like", "is anything dropping". When you find an unfamiliar host, walk the sections top-to-bottom and you'll have a full picture in 5 minutes.

Want to see the 2-minute health check (section 15 below) run live? Hostname → RDMA devices → port states → PFC config → source rules → nvidia_peermem → link-down counters. If any line goes red, you know exactly where to start:

MODULE production-operations · LAB 3Watch the recording — every command, every counter, every output.

1. Quick box identity

You want to know...Command
Hostnamehostname
OS + kerneluname -r ; cat /etc/os-release | grep -E '^(NAME|VERSION)='
CPU modellscpu | grep 'Model name'
Sockets + NUMAlscpu | grep -E 'Socket|NUMA'
RAMfree -h
Uptimeuptime

2. NIC inventory

TaskCommand
List Mellanox/NVIDIA PCI deviceslspci | grep -i mellanox
List RDMA devicesibv_devices
RDMA link staterdma link show
IB-style port state + rateibstat | grep -E 'CA |State:|Rate:|Port '
Full RDMA device dumpibv_devinfo -d mlx5_0 (or any device)
Netdev listip -br link show
Per-NIC link + MTUip link show dev enp1s0f0

vendor_part_id decode (ConnectX family):

4119 = CX-5 4125 = CX-6 Dx 4129 = CX-7 4131 = CX-8

3. PCIe + NUMA + GPU topology

TaskCommand
NUMA topologynumactl -H
NUMA node per NICcat /sys/class/net/enp1s0f0/device/numa_node
Local CPUs per NICcat /sys/class/net/enp1s0f0/device/local_cpulist
PCIe link state per NICsudo lspci -vv -s 03:00.0 | grep -E 'LnkCap|LnkSta'
Full PCIe topologylstopo --no-io --no-icaches
GPU inventorynvidia-smi -L
GPU↔NIC matrixnvidia-smi topo -m

PCIe speed table:

Gen3 = 8 GT/s, x16 = 128 Gbps
Gen4 = 16 GT/s, x16 = 256 Gbps
Gen5 = 32 GT/s, x16 = 512 Gbps ← H100 hosts
Gen6 = 64 GT/s, x16 = 1024 Gbps ← B200 / B300 hosts

GPU↔NIC topology legend (nvidia-smi topo -m):

PIX = same PCIe switch (ideal for GPUDirect RDMA)
NODE = same NUMA, different PCIe switch
PHB = same PCIe Host Bridge (CPU)
SYS = cross-NUMA via UPI/Infinity Fabric — NEVER for RDMA
NVx = NVLink between GPUs only

4. ibv_devinfo — every field decoded

FieldMeaning
hca_idRDMA device name (e.g. mlx5_0)
transport: InfiniBandVerbs API model — always this on Mellanox/NVIDIA, even for RoCE
link_layer: EthernetThe line that confirms RoCE (vs InfiniBand for true IB)
fw_verNIC firmware version
node_guidGlobally unique NIC ID (64-bit)
vendor_part_idNIC chip ID (see decode table above)
phys_port_cntPhysical ports on this NIC
state: PORT_ACTIVE (4)Link is up; (1) = down
active_mtu: 4096 (5)RDMA-level MTU (NOT Ethernet MTU)
sm_lid: 0IB-only; always 0 on RoCE

MTU gotcha — there are always two:

  • RDMA MTU 4096 = max RDMA packet size (NIC silicon)
  • Ethernet MTU 9000 = jumbo frame size (ip link)
  • Both must align. Mismatch = drops at line rate.

5. Driver + firmware stack

TaskCommand
Loaded mlx5 moduleslsmod | grep -E 'mlx5|ib_|rdma'
mlx5_core infomodinfo mlx5_core | head -10
Driver + firmware per NICethtool -i enp1s0f0
OFED flavor (vendor OFED installed?)ofed_info -s 2>&1
OFED packagesrpm -qa | grep -iE 'mlnx|mellanox|rdma|ibverbs' (or dpkg -l)
Module file pathmodinfo -F filename mlx5_core
/dev/infiniband entry pointsls -la /dev/infiniband/
/sys/class/infiniband devicesls /sys/class/infiniband/

Path tells flavor:

/lib/modules/.../updates/... → vendor OFED override (DOCA / MLNX_OFED)
/lib/modules/.../kernel/... → inbox OFED (distro default)

The three mlx5 modules:

  • mlx5_core — PCI device owner (foundation)
  • mlx5_ib — RDMA personality (Verbs API)
  • mlx5_en — Ethernet personality (netdev)

GPUDirect bridge module:

lsmod | grep nvidia_peermem # must show loaded for GPUDirect RDMA

6. SR-IOV (PFs and VFs)

TaskCommand
Firmware max VFscat /sys/class/net/enp1s0f0/device/sriov_totalvfs
Active VFs nowcat /sys/class/net/enp1s0f0/device/sriov_numvfs
Spawn N VFsecho N | sudo tee /sys/class/net/enp1s0f0/device/sriov_numvfs
Destroy all VFsecho 0 | sudo tee /sys/class/net/enp1s0f0/device/sriov_numvfs
Firmware querysudo mlxconfig -d /sys/bus/pci/devices/0000:03:00.0 query | grep -iE 'SRIOV_EN|NUM_OF_VFS|NUM_OF_PF'
List VFs in lspcilspci -D | grep '0000:03:00\.'
VF state from PFip link show dev enp1s0f0
Assign VF MACsudo ip link set dev enp1s0f0 vf 0 mac 02:00:00:01:00:00
Assign VF IPsudo ip addr add 10.0.0.10/27 dev enp1s0f0v0
IOMMU groupsfor bdf in 03:00.0 03:00.1; do echo "$bdf: $(readlink /sys/bus/pci/devices/0000:$bdf/iommu_group | xargs basename)"; done

The lifecycle:

firmware enable → reboot → runtime spawn (echo N) → assign MAC/IP → workload binds

Inside a Kubernetes pod: VFs show as mlx5_0..N (or whatever the CNI / Multus names them). Standard pattern is one pod = one VF per PF.


7. GID table

TaskCommand
List all GIDs (vendor utility)show_gids mlx5_0
Raw GID dump (any kernel)for i in $(seq 0 7); do echo "[$i] $(cat /sys/class/infiniband/mlx5_0/ports/1/gids/$i)"; done
GID type per slotcat /sys/class/infiniband/mlx5_0/ports/1/gid_attrs/types/3

Typical GID slot layout (PF only):

[0] fe80::.... IB / RoCE v1 (legacy, don't use)
[1] fe80::.... RoCE v2 (IPv6 link-local)
[2] ::ffff:X.X.X.X IB / RoCE v1 (legacy)
[3] ::ffff:X.X.X.X RoCE v2 ← NCCL_IB_GID_INDEX=3

With VFs, GIDs continue in 4-slot blocks (v1+v2 × IPv6+IPv4).


8. Lossless host config (PFC + ECN + DSCP)

TaskCommand
Full QoS summarysudo mlnx_qos -i enp1s0f0
Trust modecat /sys/class/net/enp1s0f0/qos/trust
Set trust DSCPsudo mlnx_qos -i enp1s0f0 --trust dscp
DSCP → priority mapsudo mlnx_qos -i enp1s0f0 --dscp2prio set,26,3
Priority → TC mapsudo mlnx_qos -i enp1s0f0 --prio_tc 0,0,0,3,0,0,0,0
Enable PFC on priority 3sudo mlnx_qos -i enp1s0f0 --pfc 0,0,0,1,0,0,0,0
Enable ECN NP+RP on prio 3echo 1 | sudo tee /sys/class/net/enp1s0f0/ecn/roce_np/enable/3 /sys/class/net/enp1s0f0/ecn/roce_rp/enable/3
Set ring sizes (max)sudo ethtool -G enp1s0f0 rx 8192 tx 8192
Show ring sizesethtool -g enp1s0f0
Set egress DSCP for RDMAecho 26 | sudo tee /sys/class/infiniband/mlx5_0/tc/1/traffic_class
DSCP capturesudo tcpdump -i enp1s0f0 -nn -e -c 5 | grep tos

The canonical chain — memorize this:

DSCP 26 → Priority 3 → Traffic Class 3 → Lossless Queue (PFC + ECN)

NCCL_IB_TC=106 math: DSCP 26 × 4 = 104 (TOS byte = DSCP shifted left 2 bits), + 2 ECT bits = 106. TC=0 in NCCL = lossy queue = IBV_WC_RETRY_EXC_ERR under load.

Typical switch-side buffer thresholds (priority 3, lossless queue):

Kmin = 1 MB (ECN marking starts)
Kmax = 5 MB (ECN marking at max probability ~10%)
XOFF = ~8 MB (PFC PAUSE fires — last-resort safety net)

Exact numbers depend on switch silicon, port speed, and cable RTT.


9. RDMA counters — where they live, what they mean

Path 1: RoCE-specific (/sys/class/infiniband/<dev>/ports/1/hw_counters/)

CounterWhat going up means
np_ecn_marked_roce_packetsRX packets with CE bit set
np_cnp_sentCNPs this NIC generated (acting as Notification Point)
rp_cnp_handledCNPs reacted to (DCQCN engaged, slowing down)
rp_cnp_ignoredCNPs ignored — any non-zero = tuning bug
out_of_sequenceReordering observed (adaptive routing / multipath)
packet_seq_errDrops detected
req_rnr_retries_exceededQP died — Receiver Not Ready retry exhausted
req_transport_retries_exceededQP died — transport retry exhausted
local_ack_timeout_errQP died — ACK timeout
roce_slow_restartDCQCN slow-restart events
out_of_bufferApp not posting receive WRs fast enough
rx_read_requests, rx_write_requests, rx_atomic_requestsTraffic stats

Path 2: Standard IB counters (/sys/class/infiniband/<dev>/ports/1/counters/)

CounterWhat going up means
port_xmit_dataBytes sent (units of octets/4)
port_rcv_dataBytes received (units of octets/4)
port_xmit_packets, port_rcv_packetsPacket counts
port_xmit_discardsTX drops — check ring sizes
port_xmit_waitTX wait cycles — proxy for PFC pause time
link_downedLink flap count (>0 = bad optic/cable)
symbol_errorPhysical-layer errors

Pre/post benchmark counter diff

The cleanest way to see what a single workload caused:

mkdir -p /tmp/pre /tmp/post

# Snapshot before
for f in /sys/class/infiniband/mlx5_0/ports/1/hw_counters/*; do
cp "$f" /tmp/pre/$(basename "$f")
done

# ... run benchmark ...

# Snapshot after, print deltas
for f in /sys/class/infiniband/mlx5_0/ports/1/hw_counters/*; do
cp "$f" /tmp/post/$(basename "$f")
done

for f in /tmp/post/*; do
pre=$(cat /tmp/pre/$(basename "$f"))
post=$(cat "$f")
delta=$((post - pre))
[ "$delta" -gt 0 ] && echo "$(basename "$f"): $delta"
done

10. Multi-rail Linux routing

TaskCommand
List ip rulesip rule show
Show tableip route show table 101
Test routing decisionip route get 10.0.1.14 from 10.0.0.10
Add per-NIC table routesudo ip route add 10.0.0.0/16 via 10.0.0.1 dev enp1s0f0 table 101
Add source-based rulesudo ip rule add from 10.0.0.10 lookup 101 priority 10010
Check ARP sysctlsfor nic in all enp1s0f0 enp1s0f1; do echo "$nic: $(sysctl -n net.ipv4.conf.$nic.arp_ignore net.ipv4.conf.$nic.arp_announce net.ipv4.conf.$nic.rp_filter)"; done
Set multi-rail sysctls (use all)sudo sysctl -w net.ipv4.conf.all.arp_ignore=2 net.ipv4.conf.all.arp_announce=1 net.ipv4.conf.all.rp_filter=2 net.ipv4.conf.all.accept_local=1
Source-bound ping (per rail)ping -c 2 -I 10.0.0.10 10.0.1.14
Source-bound TCP testnc -s 10.0.0.10 -zv 10.0.1.14 22
Capture ARP per NICsudo tcpdump -nn -e -i enp1s0f0 arp -c 10

The four pieces of multi-rail correctness — all four must be in place:

  1. Per-NIC routing tables (101 / 102 / 103 / 104)
  2. Source-based rules (from <ip> lookup <table>)
  3. ARP tuning (arp_ignore=2, arp_announce=1)
  4. Loose RPF (rp_filter=2)

Sysctl all override: Effective value = max(all, per-iface). Setting on all is enough.

Why source binding matters: without source rules, the kernel picks an arbitrary egress NIC by longest-prefix-match — usually wrong on a multi-rail box. Apps must pin their source IP explicitly (--bind_source_ip, -I, SO_BINDTODEVICE, etc.).


11. perftest benchmarking

TaskCommand
Install perftestsudo apt install perftest / sudo dnf install perftest
Server (single rail)nohup ib_send_bw -d mlx5_0 -R -F -D 60 > /tmp/s.log 2>&1 &
Client (peer host)ib_send_bw -d mlx5_0 -R -F -D 60 <server_ip>
Throughput (WRITE, faster)ib_write_bw -d mlx5_0 -R -b -F -x 3 -q 16 -s 131072 -t 512 -D 30 --report_gbits --bind_source_ip=10.0.0.10 <server_ip>
Latency (small-msg ping-pong)ib_send_lat -d mlx5_0 -R -F <server_ip>
GPUDirect (CUDA-built perftest)ib_write_bw --use_cuda=0 -R -b -F -x 3 ... <server_ip>

Critical flag reference:

FlagWhat it doesWhen you need it
-RUse RDMA-CM for connection setup (bypasses TCP routing)Always on multi-rail hosts
-bBidirectionalProduction-realistic
-x 3RoCEv2 IPv4 GID indexAlways for RoCE v2
--bind_source_ipPin source IP on this railRequired on multi-rail w/o main-table fallback
-q 1616 QPs per pairOptimal at 400G+
-s 131072128 KB messagesNCCL-shaped traffic
-t 512TX depthDeeper PCIe pipelining
-FDon't fail on CPU freq scaling warningConvenience
-D 6060-second test durationStable averages

Expected results (rough — single-pair, healthy fabric):

TestPer-NICAggregate (per host)
H100, single-pair, single-rail~385 Gbps (96% wire)
H100, NCCL allreduce, 4-node~195 GB/s busbw
B200/B300, single-PF alone~392 Gbps (98% wire)
B200/B300, all PFs concurrent (bidir)350–375 Gbps each~5000 Gbps aggregate
B200/B300, NCCL NVLSTree allreduce~880 GB/s busbw
Cross-rack ICMP latency0.13–0.5 ms
Cross-rack RDMA latency3–5 µs

Bandwidth way below expected? Check in this order:

  1. PCIe link state (LnkSta) — Gen5/6 x16?
  2. NUMA pinning — taskset to the NIC's local CPUs?
  3. GID index — -x 3 for RoCEv2 IPv4?
  4. Ring sizes — ethtool -g, set 8192/8192
  5. PFC firing? Check port_xmit_wait or np_cnp_sent
  6. Source binding — --bind_source_ip on multi-rail?

12. NCCL environment variables

VariableProduction valueWhy
NCCL_IB_HCA^ib,^mlx5_1:1 (SR-IOV) or mlx5_0,mlx5_1,mlx5_2,mlx5_3 (hostnet)Pin to RDMA devices
NCCL_IB_GID_INDEX3RoCEv2 IPv4 GID
NCCL_IB_TC106DSCP 26 + ECN ECT bits = lossless queue
NCCL_IB_SL5Service Level mapped to TC 106
NCCL_CROSS_NIC0Each rail handles its own traffic
NCCL_IB_QPS_PER_CONNECTION16Spread QPs for full bandwidth
NCCL_MIN_NCHANNELS / NCCL_MAX_NCHANNELS16Match QP count
NCCL_IB_PCI_RELAXED_ORDERING1H100 perf (~10% gain)
NCCL_IB_TIMEOUT22Longer than default for PFC tolerance
NCCL_IB_RETRY_CNT12RNR retry count
NCCL_NET_GDR_LEVELPHBMin GPU↔NIC distance for GPUDirect RDMA
NCCL_SOCKET_IFNAMEbond0Bootstrap on control plane (NOT the RDMA NICs)
NCCL_DEBUGINFO (first run) / unset (prod)Verbose logging
NCCL_IB_ADAPTIVE_ROUTING1Adaptive-routing-capable fabrics
NCCL_IB_SPLIT_DATA_ON_QPS1Multi-QP load distribution

Pitfall — bootstrap interface: NCCL_SOCKET_IFNAME must point at a non-RDMA interface (typically a TCP bond). If you accidentally point it at an RDMA NIC, bootstrap will work but you'll lose throughput because some bookkeeping traffic competes with the data plane.


13. Pre-flight checklist for a new RoCE host

Run all these on a freshly-imaged host before declaring it ready for production workloads:

Hardware

□ nvidia-smi -L → all 8 GPUs visible
□ ibv_devices → all backend NICs visible
□ ibstat | grep 'State:' → all backend NICs PORT_ACTIVE
□ lspci -vv ... LnkSta → Gen5 x16 (H100) or Gen6 x16 (B200/B300)

Drivers

□ lsmod | grep mlx5_core → loaded
□ lsmod | grep mlx5_ib → loaded
□ lsmod | grep nvidia_peermem → loaded (GPUDirect RDMA ready)
□ ethtool -i <nic> → driver + firmware same on all NICs
□ ofed_info -s → OFED version printed (if vendor OFED used)

Lossless config

□ mlnx_qos -i <nic> | grep 'trust state' → dscp
□ mlnx_qos -i <nic> | grep -A1 'PFC config' → enabled 0 0 0 1 0 0 0 0
□ cat /sys/class/net/<nic>/ecn/roce_np/enable/3 → 1
□ cat /sys/class/net/<nic>/ecn/roce_rp/enable/3 → 1
□ ethtool -g <nic> → 8192/8192

Multi-rail routing

□ ip rule show | grep '10010\|10011\|10012\|10013' → 4 rules present
□ ip route show table 101 | grep <subnet> → table populated
□ sysctl net.ipv4.conf.all.arp_ignore → 2
□ sysctl net.ipv4.conf.all.arp_announce → 1
□ sysctl net.ipv4.conf.all.rp_filter → 2
□ ip route get <peer_ip> from <local_ip> → via correct gateway, correct dev

Counter baseline

□ All hw_counters and counters at 0 (no traffic yet)
□ link_downed = 0 (no flaps since boot)
□ symbol_error = 0 (clean physical layer)

Cross-host (with peer)

□ ping -c 3 -I <local_rail_ip> <peer_rail_ip> → 0% loss
□ Repeat for all rails
□ ib_send_bw -d <nic> -R -F -D 10 <peer> → line rate

14. Troubleshooting recipes

"RDMA bandwidth way below expected"

  1. lspci -vv ... LnkSta → confirm Gen5/Gen6 x16
  2. numactl --hardware and match process to the NIC's NUMA node
  3. Counter pre/post diff → look for np_cnp_sent (congestion) or port_xmit_wait (PFC)
  4. Try -q 16 -s 131072 -t 512 instead of defaults
  5. Verify NCCL_IB_GID_INDEX=3

"ib_send_bw hangs or connection refused on multi-rail host"

Source binding is required. Use --bind_source_ip=<local_rail_ip> and -R. This is by design — control plane and data plane are routed separately on multi-rail hosts.

"Pod stuck Pending — nvidia.com/roce_hca_*: 0"

  1. Check SR-IOV operator status: kubectl get sriovnetworknodestates -n sriov-network-operatorsyncStatus: Succeeded?
  2. Confirm node feature labels for RDMA / transport type
  3. Confirm VF count: cat /sys/class/net/<nic>/device/sriov_numvfs > 0

"IBV_WC_RETRY_EXC_ERR after a few minutes of running"

Two common causes:

  1. TC=0 lossy queue — set NCCL_IB_TC=106 and NCCL_IB_SL=5
  2. ARP ambiguity between co-located VFs on the same subnet — fix with arp_announce=1, arp_ignore=2

"Cross-rack latency above 10 µs"

  1. Confirm cross-rack traffic actually traverses the spine (TTL decode: 64 - TTL = hops)
  2. Check NUMA pinning of the process
  3. CPU governor → performance (sudo cpupower frequency-set -g performance)
  4. Check for PFC pause events on intermediate switches

"All NICs show identical config but one rail is slow"

Check switch-side per-port counters. Most likely:

  1. Bad optic / cable on that rail
  2. PFC firing on that switch port specifically
  3. ECMP hash polarization sending too much traffic over that path

"Counters frozen at 0 during a benchmark"

  1. nvidia-smi dmon -s t → if rxpci/txpci show 0 MB/s during a GPUDirect run, that's correct (data bypasses host PCIe)
  2. Check port_xmit_data on the actual sending NIC, not the GPU
  3. Confirm the workload is sending — tcpdump on the NIC

"rp_cnp_ignored > 0"

DCQCN tuning bug. NIC is receiving CNPs but not reacting. Check:

  • cat /sys/class/net/<nic>/ecn/roce_rp/enable/3 == 1
  • Firmware version supports DCQCN on this priority

"Job hangs partway through with no clear error"

Likely PFC deadlock or a buffer-credit issue at the switch. From the host side:

  1. Pull port_xmit_wait deltas across all rails — large skew = one rail is being paused much more
  2. Check switch-side PFC counters at the leaf
  3. If one rail is stuck paused, drain the misbehaving downstream consumer

15. The two-minute health check

When you SSH to a host and want the fastest possible "is this box healthy?" — paste this:

echo "=== HOSTNAME ===" ; hostname
echo "=== NICs ===" ; ibv_devices
echo "=== Link states ===" ; ibstat | grep -E "CA |State:"
echo "=== PFC + ECN ===" ; sudo mlnx_qos -i mlx5_0 | grep -A1 "PFC config"
echo "=== Source rules ===" ; ip rule show | grep -E "100[0-9][0-9]"
echo "=== nvidia_peermem ===" ; lsmod | grep nvidia_peermem
echo "=== Link errors (should be 0) ===" ; \
for n in mlx5_0 mlx5_1 mlx5_2 mlx5_3; do \
echo "$n link_downed: $(cat /sys/class/infiniband/$n/ports/1/counters/link_downed 2>/dev/null)"; \
done

Green = all NICs PORT_ACTIVE, PFC enabled on priority 3, source rules present, nvidia_peermem loaded, link_downed=0 across the board.

If anything is red, walk back through the corresponding section above before running workloads.


What you should bookmark

If you remember nothing else, these are the snippets worth pinning:

  1. Confirm RoCE is on the wire

    ibv_devinfo -d mlx5_0 | grep -E 'link_layer|active_mtu|state'
  2. The two-minute health check (section 15 above) — paste it on every fresh host.

  3. NCCL production env block

    export NCCL_IB_GID_INDEX=3
    export NCCL_IB_TC=106
    export NCCL_IB_SL=5
    export NCCL_IB_QPS_PER_CONNECTION=16
    export NCCL_CROSS_NIC=0
    export NCCL_SOCKET_IFNAME=bond0
  4. Counter pre/post diff (section 9) — the cleanest way to attribute symptoms to a workload.

  5. Lossless config one-liner

    sudo mlnx_qos -i mlx5_0 --trust dscp --pfc 0,0,0,1,0,0,0,0 --prio_tc 0,0,0,3,0,0,0,0
    echo 1 | sudo tee /sys/class/net/mlx5_0/ecn/roce_np/enable/3 \
    /sys/class/net/mlx5_0/ecn/roce_rp/enable/3
    sudo ethtool -G mlx5_0 rx 8192 tx 8192
  6. Source-based routing for one rail

    sudo ip route add 10.0.0.0/16 via 10.0.0.1 dev enp1s0f0 table 101
    sudo ip rule add from 10.0.0.10 lookup 101 priority 10010
    sudo sysctl -w net.ipv4.conf.all.arp_ignore=2 \
    net.ipv4.conf.all.arp_announce=1 \
    net.ipv4.conf.all.rp_filter=2
  7. Throughput sanity test

    # server
    ib_send_bw -d mlx5_0 -R -F -D 30
    # client
    ib_write_bw -d mlx5_0 -R -b -F -x 3 -q 16 -s 131072 -t 512 -D 30 \
    --report_gbits --bind_source_ip=10.0.0.10 <server_ip>
  8. The canonical lossless chain

    DSCP 26 → Priority 3 → TC 3 → Lossless Queue (PFC + ECN)
  9. Counters that mean "stop, investigate now"

    rp_cnp_ignored > 0 → DCQCN tuning bug
    req_rnr_retries_exceeded > 0 → QP died
    req_transport_retries_exceeded > 0 → QP died
    link_downed > 0 → bad optic/cable
  10. What confirms RoCE v2 specifically (vs IB or RoCE v1):

    • link_layer: Ethernet in ibv_devinfo
    • GID slot 3 = ::ffff:<your_ipv4> with type RoCE v2
    • tcpdump 'udp port 4791' captures traffic at line rate during a benchmark

That's the whole single-page reference. Walk the sections top-to-bottom on an unfamiliar host and you'll have a full operational picture in 5 minutes. Walk the troubleshooting recipes when something's wrong and you'll usually find the answer before you have to escalate.