RoCE v2 Operator Cheatsheet

Single-page reference card. Open it in a second tab while you work on a real RoCE host. No narrative — every section is a table of commands or a code block you can paste.

Your job here is to stop guessing and start typing. Each section answers one operational question: "what's installed", "how is the NIC configured", "what's the QoS chain look like", "is anything dropping". When you find an unfamiliar host, walk the sections top-to-bottom and you'll have a full picture in 5 minutes.

Want to see the 2-minute health check (section 15 below) run live? Hostname → RDMA devices → port states → PFC config → source rules → nvidia_peermem → link-down counters. If any line goes red, you know exactly where to start:

MODULE production-operations · LAB 3Watch the recording — every command, every counter, every output.

This is a reference card, not a lesson

The why behind lossless config, RDMA counters, and multi-rail routing lives in Phase 5 — §12.4 Host-Side Lossless and §12.5 Multi-Rail Source Routing. This page assumes you've read them and just need the commands.

1. Quick box identity

You want to know...	Command
Hostname	`hostname`
OS + kernel	`uname -r ; cat /etc/os-release \| grep -E '^(NAME\|VERSION)='`
CPU model	`lscpu \| grep 'Model name'`
Sockets + NUMA	`lscpu \| grep -E 'Socket\|NUMA'`
RAM	`free -h`
Uptime	`uptime`

2. NIC inventory

Task	Command
List RDMA NIC PCI devices (any vendor)	`lspci \| grep -iE 'mellanox\|broadcom\|intel'`
List RDMA devices	`ibv_devices`
RDMA link state	`rdma link show`
IB-style port state + rate	`ibstat \| grep -E 'CA \|State:\|Rate:\|Port '`
Full RDMA device dump	`ibv_devinfo -d mlx5_0` (or any device)
Netdev list	`ip -br link show`
Per-NIC link + MTU	`ip link show dev enp1s0f0`

vendor_part_id decode (ConnectX family):

4119 = CX-5     4125 = CX-6 Dx    4129 = CX-7    4131 = CX-8

3. PCIe + NUMA + GPU topology

Task	Command
NUMA topology	`numactl -H`
NUMA node per NIC	`cat /sys/class/net/enp1s0f0/device/numa_node`
Local CPUs per NIC	`cat /sys/class/net/enp1s0f0/device/local_cpulist`
PCIe link state per NIC	`sudo lspci -vv -s 03:00.0 \| grep -E 'LnkCap\|LnkSta'`
Full PCIe topology	`lstopo --no-io --no-icaches`
GPU inventory	`nvidia-smi -L` (AMD: `rocm-smi -L`)
GPU↔NIC matrix	`nvidia-smi topo -m` (AMD: `rocm-smi --showtopo`)

PCIe speed table:

Gen3 =  8 GT/s,   x16 = 128 Gbps
Gen4 = 16 GT/s,   x16 = 256 Gbps
Gen5 = 32 GT/s,   x16 = 512 Gbps    ← H100 hosts
Gen6 = 64 GT/s,   x16 = 1024 Gbps   ← B200 / B300 hosts

GPU↔NIC topology legend (nvidia-smi topo -m):

PIX  = same PCIe switch       (ideal for GPUDirect RDMA)
NODE = same NUMA, different PCIe switch
PHB  = same PCIe Host Bridge (CPU)
SYS  = cross-NUMA via UPI/Infinity Fabric — NEVER for RDMA
NVx  = NVLink between GPUs only

4. `ibv_devinfo` — every field decoded

Field	Meaning
`hca_id`	RDMA device name (e.g. `mlx5_0`)
`transport: InfiniBand`	Verbs API model — always this on Mellanox/NVIDIA, even for RoCE
`link_layer: Ethernet`	The line that confirms RoCE (vs `InfiniBand` for true IB)
`fw_ver`	NIC firmware version
`node_guid`	Globally unique NIC ID (64-bit)
`vendor_part_id`	NIC chip ID (see decode table above)
`phys_port_cnt`	Physical ports on this NIC
`state: PORT_ACTIVE (4)`	Link is up; `(1)` = down
`active_mtu: 4096 (5)`	RDMA-level MTU (NOT Ethernet MTU)
`sm_lid: 0`	IB-only; always 0 on RoCE

MTU gotcha — there are always two:

RDMA MTU 4096 = max RDMA packet size (NIC silicon)
Ethernet MTU 9000 = jumbo frame size (ip link)
Both must align. Mismatch = drops at line rate.

5. Driver + firmware stack

Task	Command
Loaded mlx5 modules	`lsmod \| grep -E 'mlx5\|ib_\|rdma'`
`mlx5_core` info	`modinfo mlx5_core \| head -10`
Driver + firmware per NIC	`ethtool -i enp1s0f0`
OFED flavor (vendor OFED installed?)	`ofed_info -s 2>&1`
OFED packages	`rpm -qa \| grep -iE 'mlnx\|mellanox\|rdma\|ibverbs'` (or `dpkg -l`)
Module file path	`modinfo -F filename mlx5_core`
`/dev/infiniband` entry points	`ls -la /dev/infiniband/`
`/sys/class/infiniband` devices	`ls /sys/class/infiniband/`

Path tells flavor:

/lib/modules/.../updates/...  → vendor OFED override (DOCA / MLNX_OFED)
/lib/modules/.../kernel/...   → inbox OFED (distro default)

The three mlx5 modules:

mlx5_core — PCI device owner (foundation)
mlx5_ib — RDMA personality (Verbs API)
mlx5_en — Ethernet personality (netdev)

GPUDirect bridge module:

lsmod | grep nvidia_peermem   # must show loaded for GPUDirect RDMA

6. SR-IOV (PFs and VFs)

Task	Command
Firmware max VFs	`cat /sys/class/net/enp1s0f0/device/sriov_totalvfs`
Active VFs now	`cat /sys/class/net/enp1s0f0/device/sriov_numvfs`
Spawn N VFs	`echo N \| sudo tee /sys/class/net/enp1s0f0/device/sriov_numvfs`
Destroy all VFs	`echo 0 \| sudo tee /sys/class/net/enp1s0f0/device/sriov_numvfs`
Firmware query	`sudo mlxconfig -d /sys/bus/pci/devices/0000:03:00.0 query \| grep -iE 'SRIOV_EN\|NUM_OF_VFS\|NUM_OF_PF'`
List VFs in lspci	`lspci -D \| grep '0000:03:00\.'`
VF state from PF	`ip link show dev enp1s0f0`
Assign VF MAC	`sudo ip link set dev enp1s0f0 vf 0 mac 02:00:00:01:00:00`
Assign VF IP	`sudo ip addr add 10.0.0.10/27 dev enp1s0f0v0`
IOMMU groups	`for bdf in 03:00.0 03:00.1; do echo "$bdf: $(readlink /sys/bus/pci/devices/0000:$bdf/iommu_group \| xargs basename)"; done`

The lifecycle:

firmware enable  →  reboot  →  runtime spawn (echo N)  →  assign MAC/IP  →  workload binds

Inside a Kubernetes pod: VFs show as mlx5_0..N (or whatever the CNI / Multus names them). Standard pattern is one pod = one VF per PF.

7. GID table

Task	Command
List all GIDs (vendor utility)	`show_gids mlx5_0`
Raw GID dump (any kernel)	`for i in $(seq 0 7); do echo "[$i] $(cat /sys/class/infiniband/mlx5_0/ports/1/gids/$i)"; done`
GID type per slot	`cat /sys/class/infiniband/mlx5_0/ports/1/gid_attrs/types/3`

Typical GID slot layout (PF only):

[0]  fe80::....         IB / RoCE v1   (legacy, don't use)
[1]  fe80::....         RoCE v2        (IPv6 link-local)
[2]  ::ffff:X.X.X.X     IB / RoCE v1   (legacy)
[3]  ::ffff:X.X.X.X     RoCE v2        ← NCCL_IB_GID_INDEX=3

With VFs, GIDs continue in 4-slot blocks (v1+v2 × IPv6+IPv4).

8. Lossless host config (PFC + ECN + DSCP)

Commands only — the DSCP → priority → TC chain, the trust-mode trap, and the counter-verified walk-through are taught in §12.4 Host-Side Lossless.

Task	Command
Full QoS summary	`sudo mlnx_qos -i enp1s0f0`
Trust mode	`cat /sys/class/net/enp1s0f0/qos/trust`
Set trust DSCP	`sudo mlnx_qos -i enp1s0f0 --trust dscp`
DSCP → priority map	`sudo mlnx_qos -i enp1s0f0 --dscp2prio set,26,3`
Priority → TC map	`sudo mlnx_qos -i enp1s0f0 --prio_tc 0,0,0,3,0,0,0,0`
Enable PFC on priority 3	`sudo mlnx_qos -i enp1s0f0 --pfc 0,0,0,1,0,0,0,0`
Enable ECN NP+RP on prio 3	`echo 1 \| sudo tee /sys/class/net/enp1s0f0/ecn/roce_np/enable/3 /sys/class/net/enp1s0f0/ecn/roce_rp/enable/3`
Set ring sizes (max)	`sudo ethtool -G enp1s0f0 rx 8192 tx 8192`
Show ring sizes	`ethtool -g enp1s0f0`
Set egress DSCP for RDMA	`echo 26 \| sudo tee /sys/class/infiniband/mlx5_0/tc/1/traffic_class`
DSCP capture	`sudo tcpdump -i enp1s0f0 -nn -e -c 5 \| grep tos`

The canonical chain — memorize this:

DSCP 26  →  Priority 3  →  Traffic Class 3  →  Lossless Queue (PFC + ECN)

NCCL_IB_TC=106 math: DSCP 26 × 4 = 104 (TOS byte = DSCP shifted left 2 bits), + 2 ECT bits = 106. TC=0 in NCCL = lossy queue = IBV_WC_RETRY_EXC_ERR under load.

Typical switch-side buffer thresholds (priority 3, lossless queue):

Kmin  =  1 MB    (ECN marking starts)
Kmax  =  5 MB    (ECN marking at max probability ~10%)
XOFF  =  ~8 MB   (PFC PAUSE fires — last-resort safety net)

Exact numbers depend on switch silicon, port speed, and cable RTT.

9. RDMA counters — where they live, what they mean

The full counter reference card with per-counter interpretation is in §12.4 Host-Side Lossless. Quick lookup below.

Path 1: RoCE-specific (`/sys/class/infiniband/<dev>/ports/1/hw_counters/`)

Counter	What going up means
`np_ecn_marked_roce_packets`	RX packets with CE bit set
`np_cnp_sent`	CNPs this NIC generated (acting as Notification Point)
`rp_cnp_handled`	CNPs reacted to (DCQCN engaged, slowing down)
`rp_cnp_ignored`	CNPs ignored — any non-zero = tuning bug
`out_of_sequence`	Reordering observed (adaptive routing / multipath)
`packet_seq_err`	Drops detected
`req_rnr_retries_exceeded`	QP died — Receiver Not Ready retry exhausted
`req_transport_retries_exceeded`	QP died — transport retry exhausted
`local_ack_timeout_err`	QP died — ACK timeout
`roce_slow_restart`	DCQCN slow-restart events
`out_of_buffer`	App not posting receive WRs fast enough
`rx_read_requests`, `rx_write_requests`, `rx_atomic_requests`	Traffic stats

Path 2: Standard IB counters (`/sys/class/infiniband/<dev>/ports/1/counters/`)

Counter	What going up means
`port_xmit_data`	Bytes sent (units of octets/4)
`port_rcv_data`	Bytes received (units of octets/4)
`port_xmit_packets`, `port_rcv_packets`	Packet counts
`port_xmit_discards`	TX drops — check ring sizes
`port_xmit_wait`	TX wait cycles — proxy for PFC pause time
`link_downed`	Link flap count (>0 = bad optic/cable)
`symbol_error`	Physical-layer errors

Pre/post benchmark counter diff

Snapshot hw_counters/ before and after a run, then print the deltas — the cleanest way to see what a single workload caused. The full annotated script is in §12.4 Host-Side Lossless § Pre/post-test counter diff.

10. Multi-rail Linux routing

Why any of this is needed — ARP flux, rp_filter, source-based routing tables — is taught in §12.5 Multi-Rail Source Routing. Commands and the checklist below.

Task	Command
List ip rules	`ip rule show`
Show table	`ip route show table 101`
Test routing decision	`ip route get 10.0.1.14 from 10.0.0.10`
Add per-NIC table route	`sudo ip route add 10.0.0.0/16 via 10.0.0.1 dev enp1s0f0 table 101`
Add source-based rule	`sudo ip rule add from 10.0.0.10 lookup 101 priority 10010`
Check ARP sysctls	`for nic in all enp1s0f0 enp1s0f1; do echo "$nic: $(sysctl -n net.ipv4.conf.$nic.arp_ignore net.ipv4.conf.$nic.arp_announce net.ipv4.conf.$nic.rp_filter)"; done`
Set multi-rail sysctls (use `all`)	`sudo sysctl -w net.ipv4.conf.all.arp_ignore=2 net.ipv4.conf.all.arp_announce=1 net.ipv4.conf.all.rp_filter=2 net.ipv4.conf.all.accept_local=1`
Source-bound ping (per rail)	`ping -c 2 -I 10.0.0.10 10.0.1.14`
Source-bound TCP test	`nc -s 10.0.0.10 -zv 10.0.1.14 22`
Capture ARP per NIC	`sudo tcpdump -nn -e -i enp1s0f0 arp -c 10`

The four pieces of multi-rail correctness — all four must be in place:

Per-NIC routing tables (101 / 102 / 103 / 104)
Source-based rules (from <ip> lookup <table>)
ARP tuning (arp_ignore=2, arp_announce=1)
Loose RPF (rp_filter=2)

Sysctl all override: effective value = max(all, per-iface), so setting on all is enough. Source binding: without source rules the kernel picks an arbitrary egress NIC by longest-prefix-match — usually wrong on a multi-rail box — so apps must pin their source IP (--bind_source_ip, -I, SO_BINDTODEVICE). Full explanation: §12.5 Multi-Rail Source Routing.

11. perftest benchmarking

Task	Command
Install perftest	`sudo apt install perftest` / `sudo dnf install perftest`
Server (single rail)	`nohup ib_send_bw -d mlx5_0 -R -F -D 60 > /tmp/s.log 2>&1 &`
Client (peer host)	`ib_send_bw -d mlx5_0 -R -F -D 60 <server_ip>`
Throughput (WRITE, faster)	`ib_write_bw -d mlx5_0 -R -b -F -x 3 -q 16 -s 131072 -t 512 -D 30 --report_gbits --bind_source_ip=10.0.0.10 <server_ip>`
Latency (small-msg ping-pong)	`ib_send_lat -d mlx5_0 -R -F <server_ip>`
GPUDirect (CUDA-built perftest)	`ib_write_bw --use_cuda=0 -R -b -F -x 3 ... <server_ip>`

Critical flag reference:

Flag	What it does	When you need it
`-R`	Use RDMA-CM for connection setup (bypasses TCP routing)	Always on multi-rail hosts
`-b`	Bidirectional	Production-realistic
`-x 3`	RoCEv2 IPv4 GID index	Always for RoCE v2
`--bind_source_ip`	Pin source IP on this rail	Required on multi-rail w/o main-table fallback
`-q 16`	16 QPs per pair	Optimal at 400G+
`-s 131072`	128 KB messages	NCCL-shaped traffic
`-t 512`	TX depth	Deeper PCIe pipelining
`-F`	Don't fail on CPU freq scaling warning	Convenience
`-D 60`	60-second test duration	Stable averages

Expected results (rough — single-pair, healthy fabric):

Test	Per-NIC	Aggregate (per host)
H100, single-pair, single-rail	~385 Gbps (96% wire)	—
H100, NCCL allreduce, 4-node	—	~195 GB/s busbw
B200/B300, single-PF alone	~392 Gbps (98% wire)	—
B200/B300, all PFs concurrent (bidir)	350–375 Gbps each	~5000 Gbps aggregate
B200/B300, NCCL NVLSTree allreduce	—	~880 GB/s busbw
Cross-rack ICMP latency	—	0.13–0.5 ms
Cross-rack RDMA latency	3–5 µs	—

Bandwidth way below expected? Check in this order:

PCIe link state (LnkSta) — Gen5/6 x16?
NUMA pinning — taskset to the NIC's local CPUs?
GID index — -x 3 for RoCEv2 IPv4?
Ring sizes — ethtool -g, set 8192/8192
PFC firing? Check port_xmit_wait or np_cnp_sent
Source binding — --bind_source_ip on multi-rail?

12. RoCE collective-library environment variables (NCCL / RCCL / oneCCL)

RCCL reuses every NCCL_* name unchanged; oneCCL uses CCL_* — see oneCCL docs. The table below is written with NVIDIA's NCCL_* names, and every row applies verbatim to AMD's RCCL. On Intel's oneCCL the concepts map across but the variable names live in the CCL_* namespace (CCL_LOG_LEVEL, CCL_WORKER_COUNT).

Variable	Production value	Why
`NCCL_IB_HCA`	`^ib,^mlx5_1:1` (SR-IOV) or `mlx5_0,mlx5_1,mlx5_2,mlx5_3` (hostnet)	Pin to RDMA devices
`NCCL_IB_GID_INDEX`	`3`	RoCEv2 IPv4 GID
`NCCL_IB_TC`	`106`	DSCP 26 + ECN ECT bits = lossless queue
`NCCL_IB_SL`	`5`	Service Level mapped to TC 106
`NCCL_CROSS_NIC`	`0`	Each rail handles its own traffic
`NCCL_IB_QPS_PER_CONNECTION`	`16`	Spread QPs for full bandwidth
`NCCL_MIN_NCHANNELS` / `NCCL_MAX_NCHANNELS`	`16`	Match QP count
`NCCL_IB_PCI_RELAXED_ORDERING`	`1`	H100 perf (~10% gain)
`NCCL_IB_TIMEOUT`	`22`	Longer than default for PFC tolerance
`NCCL_IB_RETRY_CNT`	`12`	RNR retry count
`NCCL_NET_GDR_LEVEL`	`PHB`	Min GPU↔NIC distance for GPUDirect RDMA
`NCCL_SOCKET_IFNAME`	`bond0`	Bootstrap on control plane (NOT the RDMA NICs)
`NCCL_DEBUG`	`INFO` (first run) / unset (prod)	Verbose logging
`NCCL_IB_ADAPTIVE_ROUTING`	`1`	Adaptive-routing-capable fabrics
`NCCL_IB_SPLIT_DATA_ON_QPS`	`1`	Multi-QP load distribution

Pitfall — bootstrap interface: NCCL_SOCKET_IFNAME must point at a non-RDMA interface (typically a TCP bond). If you accidentally point it at an RDMA NIC, bootstrap will work but you'll lose throughput because some bookkeeping traffic competes with the data plane.

13. Pre-flight checklist for a new RoCE host

Run all these on a freshly-imaged host before declaring it ready for production workloads:

Hardware

□ nvidia-smi -L                    → all 8 GPUs visible  (AMD: rocm-smi -L)
□ ibv_devices                      → all backend NICs visible
□ ibstat | grep 'State:'           → all backend NICs PORT_ACTIVE
□ lspci -vv ... LnkSta             → Gen5 x16 (H100) or Gen6 x16 (B200/B300)

Drivers

□ lsmod | grep mlx5_core           → loaded
□ lsmod | grep mlx5_ib             → loaded
□ lsmod | grep nvidia_peermem      → loaded (GPUDirect RDMA ready)
□ ethtool -i <nic>                 → driver + firmware same on all NICs
□ ofed_info -s                     → OFED version printed (if vendor OFED used)

Lossless config

□ mlnx_qos -i <nic> | grep 'trust state'  → dscp
□ mlnx_qos -i <nic> | grep -A1 'PFC config' → enabled 0 0 0 1 0 0 0 0
□ cat /sys/class/net/<nic>/ecn/roce_np/enable/3 → 1
□ cat /sys/class/net/<nic>/ecn/roce_rp/enable/3 → 1
□ ethtool -g <nic>                          → 8192/8192

Multi-rail routing

□ ip rule show | grep '10010\|10011\|10012\|10013'    → 4 rules present
□ ip route show table 101 | grep <subnet>             → table populated
□ sysctl net.ipv4.conf.all.arp_ignore                 → 2
□ sysctl net.ipv4.conf.all.arp_announce               → 1
□ sysctl net.ipv4.conf.all.rp_filter                  → 2
□ ip route get <peer_ip> from <local_ip>              → via correct gateway, correct dev

Counter baseline

□ All hw_counters and counters at 0 (no traffic yet)
□ link_downed = 0   (no flaps since boot)
□ symbol_error = 0  (clean physical layer)

Cross-host (with peer)

□ ping -c 3 -I <local_rail_ip> <peer_rail_ip>  → 0% loss
□ Repeat for all rails
□ ib_send_bw -d <nic> -R -F -D 10 <peer>       → line rate

14. Troubleshooting recipes

"RDMA bandwidth way below expected"

lspci -vv ... LnkSta → confirm Gen5/Gen6 x16
numactl --hardware and match process to the NIC's NUMA node
Counter pre/post diff → look for np_cnp_sent (congestion) or port_xmit_wait (PFC)
Try -q 16 -s 131072 -t 512 instead of defaults
Verify NCCL_IB_GID_INDEX=3

"`ib_send_bw` hangs or connection refused on multi-rail host"

Source binding is required. Use --bind_source_ip=<local_rail_ip> and -R. This is by design — control plane and data plane are routed separately on multi-rail hosts.

"Pod stuck Pending — `nvidia.com/roce_hca_*: 0`"

Check SR-IOV operator status: kubectl get sriovnetworknodestates -n sriov-network-operator → syncStatus: Succeeded?
Confirm node feature labels for RDMA / transport type
Confirm VF count: cat /sys/class/net/<nic>/device/sriov_numvfs > 0

"`IBV_WC_RETRY_EXC_ERR` after a few minutes of running"

Two common causes:

TC=0 lossy queue — set NCCL_IB_TC=106 and NCCL_IB_SL=5
ARP ambiguity between co-located VFs on the same subnet — fix with arp_announce=1, arp_ignore=2

"Cross-rack latency above 10 µs"

Confirm cross-rack traffic actually traverses the spine (TTL decode: 64 - TTL = hops)
Check NUMA pinning of the process
CPU governor → performance (sudo cpupower frequency-set -g performance)
Check for PFC pause events on intermediate switches

"All NICs show identical config but one rail is slow"

Check switch-side per-port counters. Most likely:

Bad optic / cable on that rail
PFC firing on that switch port specifically
ECMP hash polarization sending too much traffic over that path

"Counters frozen at 0 during a benchmark"

nvidia-smi dmon -s t → if rxpci/txpci show 0 MB/s during a GPUDirect run, that's correct (data bypasses host PCIe)
Check port_xmit_data on the actual sending NIC, not the GPU
Confirm the workload is sending — tcpdump on the NIC

"`rp_cnp_ignored` > 0"

DCQCN tuning bug. NIC is receiving CNPs but not reacting. Check:

cat /sys/class/net/<nic>/ecn/roce_rp/enable/3 == 1
Firmware version supports DCQCN on this priority

"Job hangs partway through with no clear error"

Likely PFC deadlock or a buffer-credit issue at the switch. From the host side:

Pull port_xmit_wait deltas across all rails — large skew = one rail is being paused much more
Check switch-side PFC counters at the leaf
If one rail is stuck paused, drain the misbehaving downstream consumer

15. The two-minute health check

When you SSH to a host and want the fastest possible "is this box healthy?" — paste this:

echo "=== HOSTNAME ===" ; hostname
echo "=== NICs ===" ; ibv_devices
echo "=== Link states ===" ; ibstat | grep -E "CA |State:"
echo "=== PFC + ECN ===" ; sudo mlnx_qos -i mlx5_0 | grep -A1 "PFC config"
echo "=== Source rules ===" ; ip rule show | grep -E "100[0-9][0-9]"
echo "=== nvidia_peermem ===" ; lsmod | grep nvidia_peermem
echo "=== Link errors (should be 0) ===" ; \
  for n in mlx5_0 mlx5_1 mlx5_2 mlx5_3; do \
    echo "$n link_downed: $(cat /sys/class/infiniband/$n/ports/1/counters/link_downed 2>/dev/null)"; \
  done

Green = all NICs PORT_ACTIVE, PFC enabled on priority 3, source rules present, nvidia_peermem loaded, link_downed=0 across the board.

If anything is red, walk back through the corresponding section above before running workloads.

What you should bookmark

If you remember nothing else, these are the snippets worth pinning:

Confirm RoCE is on the wire

ibv_devinfo -d mlx5_0 | grep -E 'link_layer|active_mtu|state'

The two-minute health check (section 15 above) — paste it on every fresh host.

NCCL production env block

export NCCL_IB_GID_INDEX=3
export NCCL_IB_TC=106
export NCCL_IB_SL=5
export NCCL_IB_QPS_PER_CONNECTION=16
export NCCL_CROSS_NIC=0
export NCCL_SOCKET_IFNAME=bond0

Counter pre/post diff (section 9) — the cleanest way to attribute symptoms to a workload.

Lossless config one-liner

sudo mlnx_qos -i mlx5_0 --trust dscp --pfc 0,0,0,1,0,0,0,0 --prio_tc 0,0,0,3,0,0,0,0
echo 1 | sudo tee /sys/class/net/mlx5_0/ecn/roce_np/enable/3 \
                  /sys/class/net/mlx5_0/ecn/roce_rp/enable/3
sudo ethtool -G mlx5_0 rx 8192 tx 8192

Source-based routing for one rail

sudo ip route add 10.0.0.0/16 via 10.0.0.1 dev enp1s0f0 table 101
sudo ip rule add from 10.0.0.10 lookup 101 priority 10010
sudo sysctl -w net.ipv4.conf.all.arp_ignore=2 \
               net.ipv4.conf.all.arp_announce=1 \
               net.ipv4.conf.all.rp_filter=2

Throughput sanity test

# server
ib_send_bw -d mlx5_0 -R -F -D 30
# client
ib_write_bw -d mlx5_0 -R -b -F -x 3 -q 16 -s 131072 -t 512 -D 30 \
            --report_gbits --bind_source_ip=10.0.0.10 <server_ip>

The canonical lossless chain

DSCP 26  →  Priority 3  →  TC 3  →  Lossless Queue (PFC + ECN)

Counters that mean "stop, investigate now"

rp_cnp_ignored > 0                    → DCQCN tuning bug
req_rnr_retries_exceeded > 0          → QP died
req_transport_retries_exceeded > 0    → QP died
link_downed > 0                       → bad optic/cable

What confirms RoCE v2 specifically (vs IB or RoCE v1):
- link_layer: Ethernet in ibv_devinfo
- GID slot 3 = ::ffff:<your_ipv4> with type RoCE v2
- tcpdump 'udp port 4791' captures traffic at line rate during a benchmark

That's the whole single-page reference. Walk the sections top-to-bottom on an unfamiliar host and you'll have a full operational picture in 5 minutes. Walk the troubleshooting recipes when something's wrong and you'll usually find the answer before you have to escalate.

1. Quick box identity​

2. NIC inventory​

3. PCIe + NUMA + GPU topology​

4. ibv_devinfo — every field decoded​

5. Driver + firmware stack​

6. SR-IOV (PFs and VFs)​

7. GID table​

8. Lossless host config (PFC + ECN + DSCP)​

9. RDMA counters — where they live, what they mean​

Path 1: RoCE-specific (/sys/class/infiniband/<dev>/ports/1/hw_counters/)​

Path 2: Standard IB counters (/sys/class/infiniband/<dev>/ports/1/counters/)​

Pre/post benchmark counter diff​

10. Multi-rail Linux routing​

11. perftest benchmarking​

12. RoCE collective-library environment variables (NCCL / RCCL / oneCCL)​

13. Pre-flight checklist for a new RoCE host​

Hardware​

Drivers​

Lossless config​

Multi-rail routing​

Counter baseline​

Cross-host (with peer)​

14. Troubleshooting recipes​

"RDMA bandwidth way below expected"​

"ib_send_bw hangs or connection refused on multi-rail host"​

"Pod stuck Pending — nvidia.com/roce_hca_*: 0"​

"IBV_WC_RETRY_EXC_ERR after a few minutes of running"​

"Cross-rack latency above 10 µs"​

"All NICs show identical config but one rail is slow"​

"Counters frozen at 0 during a benchmark"​

"rp_cnp_ignored > 0"​

"Job hangs partway through with no clear error"​

15. The two-minute health check​

What you should bookmark​