RoCE v2 Operator Cheatsheet
Single-page reference card. Open it in a second tab while you work on a real RoCE host. No narrative — every section is a table of commands or a code block you can paste.
Your job here is to stop guessing and start typing. Each section answers one operational question: "what's installed", "how is the NIC configured", "what's the QoS chain look like", "is anything dropping". When you find an unfamiliar host, walk the sections top-to-bottom and you'll have a full picture in 5 minutes.
Want to see the 2-minute health check (section 15 below) run live? Hostname → RDMA devices → port states → PFC config → source rules → nvidia_peermem → link-down counters. If any line goes red, you know exactly where to start:
1. Quick box identity
| You want to know... | Command |
|---|---|
| Hostname | hostname |
| OS + kernel | uname -r ; cat /etc/os-release | grep -E '^(NAME|VERSION)=' |
| CPU model | lscpu | grep 'Model name' |
| Sockets + NUMA | lscpu | grep -E 'Socket|NUMA' |
| RAM | free -h |
| Uptime | uptime |
2. NIC inventory
| Task | Command |
|---|---|
| List Mellanox/NVIDIA PCI devices | lspci | grep -i mellanox |
| List RDMA devices | ibv_devices |
| RDMA link state | rdma link show |
| IB-style port state + rate | ibstat | grep -E 'CA |State:|Rate:|Port ' |
| Full RDMA device dump | ibv_devinfo -d mlx5_0 (or any device) |
| Netdev list | ip -br link show |
| Per-NIC link + MTU | ip link show dev enp1s0f0 |
vendor_part_id decode (ConnectX family):
4119 = CX-5 4125 = CX-6 Dx 4129 = CX-7 4131 = CX-8
3. PCIe + NUMA + GPU topology
| Task | Command |
|---|---|
| NUMA topology | numactl -H |
| NUMA node per NIC | cat /sys/class/net/enp1s0f0/device/numa_node |
| Local CPUs per NIC | cat /sys/class/net/enp1s0f0/device/local_cpulist |
| PCIe link state per NIC | sudo lspci -vv -s 03:00.0 | grep -E 'LnkCap|LnkSta' |
| Full PCIe topology | lstopo --no-io --no-icaches |
| GPU inventory | nvidia-smi -L |
| GPU↔NIC matrix | nvidia-smi topo -m |
PCIe speed table:
Gen3 = 8 GT/s, x16 = 128 Gbps
Gen4 = 16 GT/s, x16 = 256 Gbps
Gen5 = 32 GT/s, x16 = 512 Gbps ← H100 hosts
Gen6 = 64 GT/s, x16 = 1024 Gbps ← B200 / B300 hosts
GPU↔NIC topology legend (nvidia-smi topo -m):
PIX = same PCIe switch (ideal for GPUDirect RDMA)
NODE = same NUMA, different PCIe switch
PHB = same PCIe Host Bridge (CPU)
SYS = cross-NUMA via UPI/Infinity Fabric — NEVER for RDMA
NVx = NVLink between GPUs only
4. ibv_devinfo — every field decoded
| Field | Meaning |
|---|---|
hca_id | RDMA device name (e.g. mlx5_0) |
transport: InfiniBand | Verbs API model — always this on Mellanox/NVIDIA, even for RoCE |
link_layer: Ethernet | The line that confirms RoCE (vs InfiniBand for true IB) |
fw_ver | NIC firmware version |
node_guid | Globally unique NIC ID (64-bit) |
vendor_part_id | NIC chip ID (see decode table above) |
phys_port_cnt | Physical ports on this NIC |
state: PORT_ACTIVE (4) | Link is up; (1) = down |
active_mtu: 4096 (5) | RDMA-level MTU (NOT Ethernet MTU) |
sm_lid: 0 | IB-only; always 0 on RoCE |
MTU gotcha — there are always two:
- RDMA MTU 4096 = max RDMA packet size (NIC silicon)
- Ethernet MTU 9000 = jumbo frame size (
ip link) - Both must align. Mismatch = drops at line rate.
5. Driver + firmware stack
| Task | Command |
|---|---|
| Loaded mlx5 modules | lsmod | grep -E 'mlx5|ib_|rdma' |
mlx5_core info | modinfo mlx5_core | head -10 |
| Driver + firmware per NIC | ethtool -i enp1s0f0 |
| OFED flavor (vendor OFED installed?) | ofed_info -s 2>&1 |
| OFED packages | rpm -qa | grep -iE 'mlnx|mellanox|rdma|ibverbs' (or dpkg -l) |
| Module file path | modinfo -F filename mlx5_core |
/dev/infiniband entry points | ls -la /dev/infiniband/ |
/sys/class/infiniband devices | ls /sys/class/infiniband/ |
Path tells flavor:
/lib/modules/.../updates/... → vendor OFED override (DOCA / MLNX_OFED)
/lib/modules/.../kernel/... → inbox OFED (distro default)
The three mlx5 modules:
mlx5_core— PCI device owner (foundation)mlx5_ib— RDMA personality (Verbs API)mlx5_en— Ethernet personality (netdev)
GPUDirect bridge module:
lsmod | grep nvidia_peermem # must show loaded for GPUDirect RDMA
6. SR-IOV (PFs and VFs)
| Task | Command |
|---|---|
| Firmware max VFs | cat /sys/class/net/enp1s0f0/device/sriov_totalvfs |
| Active VFs now | cat /sys/class/net/enp1s0f0/device/sriov_numvfs |
| Spawn N VFs | echo N | sudo tee /sys/class/net/enp1s0f0/device/sriov_numvfs |
| Destroy all VFs | echo 0 | sudo tee /sys/class/net/enp1s0f0/device/sriov_numvfs |
| Firmware query | sudo mlxconfig -d /sys/bus/pci/devices/0000:03:00.0 query | grep -iE 'SRIOV_EN|NUM_OF_VFS|NUM_OF_PF' |
| List VFs in lspci | lspci -D | grep '0000:03:00\.' |
| VF state from PF | ip link show dev enp1s0f0 |
| Assign VF MAC | sudo ip link set dev enp1s0f0 vf 0 mac 02:00:00:01:00:00 |
| Assign VF IP | sudo ip addr add 10.0.0.10/27 dev enp1s0f0v0 |
| IOMMU groups | for bdf in 03:00.0 03:00.1; do echo "$bdf: $(readlink /sys/bus/pci/devices/0000:$bdf/iommu_group | xargs basename)"; done |
The lifecycle:
firmware enable → reboot → runtime spawn (echo N) → assign MAC/IP → workload binds
Inside a Kubernetes pod: VFs show as mlx5_0..N (or whatever the CNI / Multus names them). Standard pattern is one pod = one VF per PF.
7. GID table
| Task | Command |
|---|---|
| List all GIDs (vendor utility) | show_gids mlx5_0 |
| Raw GID dump (any kernel) | for i in $(seq 0 7); do echo "[$i] $(cat /sys/class/infiniband/mlx5_0/ports/1/gids/$i)"; done |
| GID type per slot | cat /sys/class/infiniband/mlx5_0/ports/1/gid_attrs/types/3 |
Typical GID slot layout (PF only):
[0] fe80::.... IB / RoCE v1 (legacy, don't use)
[1] fe80::.... RoCE v2 (IPv6 link-local)
[2] ::ffff:X.X.X.X IB / RoCE v1 (legacy)
[3] ::ffff:X.X.X.X RoCE v2 ← NCCL_IB_GID_INDEX=3
With VFs, GIDs continue in 4-slot blocks (v1+v2 × IPv6+IPv4).
8. Lossless host config (PFC + ECN + DSCP)
| Task | Command |
|---|---|
| Full QoS summary | sudo mlnx_qos -i enp1s0f0 |
| Trust mode | cat /sys/class/net/enp1s0f0/qos/trust |
| Set trust DSCP | sudo mlnx_qos -i enp1s0f0 --trust dscp |
| DSCP → priority map | sudo mlnx_qos -i enp1s0f0 --dscp2prio set,26,3 |
| Priority → TC map | sudo mlnx_qos -i enp1s0f0 --prio_tc 0,0,0,3,0,0,0,0 |
| Enable PFC on priority 3 | sudo mlnx_qos -i enp1s0f0 --pfc 0,0,0,1,0,0,0,0 |
| Enable ECN NP+RP on prio 3 | echo 1 | sudo tee /sys/class/net/enp1s0f0/ecn/roce_np/enable/3 /sys/class/net/enp1s0f0/ecn/roce_rp/enable/3 |
| Set ring sizes (max) | sudo ethtool -G enp1s0f0 rx 8192 tx 8192 |
| Show ring sizes | ethtool -g enp1s0f0 |
| Set egress DSCP for RDMA | echo 26 | sudo tee /sys/class/infiniband/mlx5_0/tc/1/traffic_class |
| DSCP capture | sudo tcpdump -i enp1s0f0 -nn -e -c 5 | grep tos |
The canonical chain — memorize this:
DSCP 26 → Priority 3 → Traffic Class 3 → Lossless Queue (PFC + ECN)
NCCL_IB_TC=106 math: DSCP 26 × 4 = 104 (TOS byte = DSCP shifted left 2 bits), + 2 ECT bits = 106. TC=0 in NCCL = lossy queue = IBV_WC_RETRY_EXC_ERR under load.
Typical switch-side buffer thresholds (priority 3, lossless queue):
Kmin = 1 MB (ECN marking starts)
Kmax = 5 MB (ECN marking at max probability ~10%)
XOFF = ~8 MB (PFC PAUSE fires — last-resort safety net)
Exact numbers depend on switch silicon, port speed, and cable RTT.
9. RDMA counters — where they live, what they mean
Path 1: RoCE-specific (/sys/class/infiniband/<dev>/ports/1/hw_counters/)
| Counter | What going up means |
|---|---|
np_ecn_marked_roce_packets | RX packets with CE bit set |
np_cnp_sent | CNPs this NIC generated (acting as Notification Point) |
rp_cnp_handled | CNPs reacted to (DCQCN engaged, slowing down) |
rp_cnp_ignored | CNPs ignored — any non-zero = tuning bug |
out_of_sequence | Reordering observed (adaptive routing / multipath) |
packet_seq_err | Drops detected |
req_rnr_retries_exceeded | QP died — Receiver Not Ready retry exhausted |
req_transport_retries_exceeded | QP died — transport retry exhausted |
local_ack_timeout_err | QP died — ACK timeout |
roce_slow_restart | DCQCN slow-restart events |
out_of_buffer | App not posting receive WRs fast enough |
rx_read_requests, rx_write_requests, rx_atomic_requests | Traffic stats |
Path 2: Standard IB counters (/sys/class/infiniband/<dev>/ports/1/counters/)
| Counter | What going up means |
|---|---|
port_xmit_data | Bytes sent (units of octets/4) |
port_rcv_data | Bytes received (units of octets/4) |
port_xmit_packets, port_rcv_packets | Packet counts |
port_xmit_discards | TX drops — check ring sizes |
port_xmit_wait | TX wait cycles — proxy for PFC pause time |
link_downed | Link flap count (>0 = bad optic/cable) |
symbol_error | Physical-layer errors |
Pre/post benchmark counter diff
The cleanest way to see what a single workload caused:
mkdir -p /tmp/pre /tmp/post
# Snapshot before
for f in /sys/class/infiniband/mlx5_0/ports/1/hw_counters/*; do
cp "$f" /tmp/pre/$(basename "$f")
done
# ... run benchmark ...
# Snapshot after, print deltas
for f in /sys/class/infiniband/mlx5_0/ports/1/hw_counters/*; do
cp "$f" /tmp/post/$(basename "$f")
done
for f in /tmp/post/*; do
pre=$(cat /tmp/pre/$(basename "$f"))
post=$(cat "$f")
delta=$((post - pre))
[ "$delta" -gt 0 ] && echo "$(basename "$f"): $delta"
done
10. Multi-rail Linux routing
| Task | Command |
|---|---|
| List ip rules | ip rule show |
| Show table | ip route show table 101 |
| Test routing decision | ip route get 10.0.1.14 from 10.0.0.10 |
| Add per-NIC table route | sudo ip route add 10.0.0.0/16 via 10.0.0.1 dev enp1s0f0 table 101 |
| Add source-based rule | sudo ip rule add from 10.0.0.10 lookup 101 priority 10010 |
| Check ARP sysctls | for nic in all enp1s0f0 enp1s0f1; do echo "$nic: $(sysctl -n net.ipv4.conf.$nic.arp_ignore net.ipv4.conf.$nic.arp_announce net.ipv4.conf.$nic.rp_filter)"; done |
Set multi-rail sysctls (use all) | sudo sysctl -w net.ipv4.conf.all.arp_ignore=2 net.ipv4.conf.all.arp_announce=1 net.ipv4.conf.all.rp_filter=2 net.ipv4.conf.all.accept_local=1 |
| Source-bound ping (per rail) | ping -c 2 -I 10.0.0.10 10.0.1.14 |
| Source-bound TCP test | nc -s 10.0.0.10 -zv 10.0.1.14 22 |
| Capture ARP per NIC | sudo tcpdump -nn -e -i enp1s0f0 arp -c 10 |
The four pieces of multi-rail correctness — all four must be in place:
- Per-NIC routing tables (101 / 102 / 103 / 104)
- Source-based rules (
from <ip> lookup <table>) - ARP tuning (
arp_ignore=2, arp_announce=1) - Loose RPF (
rp_filter=2)
Sysctl all override: Effective value = max(all, per-iface). Setting on all is enough.
Why source binding matters: without source rules, the kernel picks an arbitrary egress NIC by longest-prefix-match — usually wrong on a multi-rail box. Apps must pin their source IP explicitly (--bind_source_ip, -I, SO_BINDTODEVICE, etc.).
11. perftest benchmarking
| Task | Command |
|---|---|
| Install perftest | sudo apt install perftest / sudo dnf install perftest |
| Server (single rail) | nohup ib_send_bw -d mlx5_0 -R -F -D 60 > /tmp/s.log 2>&1 & |
| Client (peer host) | ib_send_bw -d mlx5_0 -R -F -D 60 <server_ip> |
| Throughput (WRITE, faster) | ib_write_bw -d mlx5_0 -R -b -F -x 3 -q 16 -s 131072 -t 512 -D 30 --report_gbits --bind_source_ip=10.0.0.10 <server_ip> |
| Latency (small-msg ping-pong) | ib_send_lat -d mlx5_0 -R -F <server_ip> |
| GPUDirect (CUDA-built perftest) | ib_write_bw --use_cuda=0 -R -b -F -x 3 ... <server_ip> |
Critical flag reference:
| Flag | What it does | When you need it |
|---|---|---|
-R | Use RDMA-CM for connection setup (bypasses TCP routing) | Always on multi-rail hosts |
-b | Bidirectional | Production-realistic |
-x 3 | RoCEv2 IPv4 GID index | Always for RoCE v2 |
--bind_source_ip | Pin source IP on this rail | Required on multi-rail w/o main-table fallback |
-q 16 | 16 QPs per pair | Optimal at 400G+ |
-s 131072 | 128 KB messages | NCCL-shaped traffic |
-t 512 | TX depth | Deeper PCIe pipelining |
-F | Don't fail on CPU freq scaling warning | Convenience |
-D 60 | 60-second test duration | Stable averages |
Expected results (rough — single-pair, healthy fabric):
| Test | Per-NIC | Aggregate (per host) |
|---|---|---|
| H100, single-pair, single-rail | ~385 Gbps (96% wire) | — |
| H100, NCCL allreduce, 4-node | — | ~195 GB/s busbw |
| B200/B300, single-PF alone | ~392 Gbps (98% wire) | — |
| B200/B300, all PFs concurrent (bidir) | 350–375 Gbps each | ~5000 Gbps aggregate |
| B200/B300, NCCL NVLSTree allreduce | — | ~880 GB/s busbw |
| Cross-rack ICMP latency | — | 0.13–0.5 ms |
| Cross-rack RDMA latency | 3–5 µs | — |
Bandwidth way below expected? Check in this order:
- PCIe link state (
LnkSta) — Gen5/6 x16? - NUMA pinning —
tasksetto the NIC's local CPUs? - GID index —
-x 3for RoCEv2 IPv4? - Ring sizes —
ethtool -g, set 8192/8192 - PFC firing? Check
port_xmit_waitornp_cnp_sent - Source binding —
--bind_source_ipon multi-rail?
12. NCCL environment variables
| Variable | Production value | Why |
|---|---|---|
NCCL_IB_HCA | ^ib,^mlx5_1:1 (SR-IOV) or mlx5_0,mlx5_1,mlx5_2,mlx5_3 (hostnet) | Pin to RDMA devices |
NCCL_IB_GID_INDEX | 3 | RoCEv2 IPv4 GID |
NCCL_IB_TC | 106 | DSCP 26 + ECN ECT bits = lossless queue |
NCCL_IB_SL | 5 | Service Level mapped to TC 106 |
NCCL_CROSS_NIC | 0 | Each rail handles its own traffic |
NCCL_IB_QPS_PER_CONNECTION | 16 | Spread QPs for full bandwidth |
NCCL_MIN_NCHANNELS / NCCL_MAX_NCHANNELS | 16 | Match QP count |
NCCL_IB_PCI_RELAXED_ORDERING | 1 | H100 perf (~10% gain) |
NCCL_IB_TIMEOUT | 22 | Longer than default for PFC tolerance |
NCCL_IB_RETRY_CNT | 12 | RNR retry count |
NCCL_NET_GDR_LEVEL | PHB | Min GPU↔NIC distance for GPUDirect RDMA |
NCCL_SOCKET_IFNAME | bond0 | Bootstrap on control plane (NOT the RDMA NICs) |
NCCL_DEBUG | INFO (first run) / unset (prod) | Verbose logging |
NCCL_IB_ADAPTIVE_ROUTING | 1 | Adaptive-routing-capable fabrics |
NCCL_IB_SPLIT_DATA_ON_QPS | 1 | Multi-QP load distribution |
Pitfall — bootstrap interface: NCCL_SOCKET_IFNAME must point at a non-RDMA interface (typically a TCP bond). If you accidentally point it at an RDMA NIC, bootstrap will work but you'll lose throughput because some bookkeeping traffic competes with the data plane.
13. Pre-flight checklist for a new RoCE host
Run all these on a freshly-imaged host before declaring it ready for production workloads:
Hardware
□ nvidia-smi -L → all 8 GPUs visible
□ ibv_devices → all backend NICs visible
□ ibstat | grep 'State:' → all backend NICs PORT_ACTIVE
□ lspci -vv ... LnkSta → Gen5 x16 (H100) or Gen6 x16 (B200/B300)
Drivers
□ lsmod | grep mlx5_core → loaded
□ lsmod | grep mlx5_ib → loaded
□ lsmod | grep nvidia_peermem → loaded (GPUDirect RDMA ready)
□ ethtool -i <nic> → driver + firmware same on all NICs
□ ofed_info -s → OFED version printed (if vendor OFED used)
Lossless config
□ mlnx_qos -i <nic> | grep 'trust state' → dscp
□ mlnx_qos -i <nic> | grep -A1 'PFC config' → enabled 0 0 0 1 0 0 0 0
□ cat /sys/class/net/<nic>/ecn/roce_np/enable/3 → 1
□ cat /sys/class/net/<nic>/ecn/roce_rp/enable/3 → 1
□ ethtool -g <nic> → 8192/8192
Multi-rail routing
□ ip rule show | grep '10010\|10011\|10012\|10013' → 4 rules present
□ ip route show table 101 | grep <subnet> → table populated
□ sysctl net.ipv4.conf.all.arp_ignore → 2
□ sysctl net.ipv4.conf.all.arp_announce → 1
□ sysctl net.ipv4.conf.all.rp_filter → 2
□ ip route get <peer_ip> from <local_ip> → via correct gateway, correct dev
Counter baseline
□ All hw_counters and counters at 0 (no traffic yet)
□ link_downed = 0 (no flaps since boot)
□ symbol_error = 0 (clean physical layer)
Cross-host (with peer)
□ ping -c 3 -I <local_rail_ip> <peer_rail_ip> → 0% loss
□ Repeat for all rails
□ ib_send_bw -d <nic> -R -F -D 10 <peer> → line rate
14. Troubleshooting recipes
"RDMA bandwidth way below expected"
lspci -vv ... LnkSta→ confirm Gen5/Gen6 x16numactl --hardwareand match process to the NIC's NUMA node- Counter pre/post diff → look for
np_cnp_sent(congestion) orport_xmit_wait(PFC) - Try
-q 16 -s 131072 -t 512instead of defaults - Verify
NCCL_IB_GID_INDEX=3
"ib_send_bw hangs or connection refused on multi-rail host"
Source binding is required. Use --bind_source_ip=<local_rail_ip> and -R. This is by design — control plane and data plane are routed separately on multi-rail hosts.
"Pod stuck Pending — nvidia.com/roce_hca_*: 0"
- Check SR-IOV operator status:
kubectl get sriovnetworknodestates -n sriov-network-operator→syncStatus: Succeeded? - Confirm node feature labels for RDMA / transport type
- Confirm VF count:
cat /sys/class/net/<nic>/device/sriov_numvfs > 0
"IBV_WC_RETRY_EXC_ERR after a few minutes of running"
Two common causes:
- TC=0 lossy queue — set
NCCL_IB_TC=106andNCCL_IB_SL=5 - ARP ambiguity between co-located VFs on the same subnet — fix with
arp_announce=1, arp_ignore=2
"Cross-rack latency above 10 µs"
- Confirm cross-rack traffic actually traverses the spine (TTL decode:
64 - TTL = hops) - Check NUMA pinning of the process
- CPU governor →
performance(sudo cpupower frequency-set -g performance) - Check for PFC pause events on intermediate switches
"All NICs show identical config but one rail is slow"
Check switch-side per-port counters. Most likely:
- Bad optic / cable on that rail
- PFC firing on that switch port specifically
- ECMP hash polarization sending too much traffic over that path
"Counters frozen at 0 during a benchmark"
nvidia-smi dmon -s t→ ifrxpci/txpcishow 0 MB/s during a GPUDirect run, that's correct (data bypasses host PCIe)- Check
port_xmit_dataon the actual sending NIC, not the GPU - Confirm the workload is sending —
tcpdumpon the NIC
"rp_cnp_ignored > 0"
DCQCN tuning bug. NIC is receiving CNPs but not reacting. Check:
cat /sys/class/net/<nic>/ecn/roce_rp/enable/3== 1- Firmware version supports DCQCN on this priority
"Job hangs partway through with no clear error"
Likely PFC deadlock or a buffer-credit issue at the switch. From the host side:
- Pull
port_xmit_waitdeltas across all rails — large skew = one rail is being paused much more - Check switch-side PFC counters at the leaf
- If one rail is stuck paused, drain the misbehaving downstream consumer
15. The two-minute health check
When you SSH to a host and want the fastest possible "is this box healthy?" — paste this:
echo "=== HOSTNAME ===" ; hostname
echo "=== NICs ===" ; ibv_devices
echo "=== Link states ===" ; ibstat | grep -E "CA |State:"
echo "=== PFC + ECN ===" ; sudo mlnx_qos -i mlx5_0 | grep -A1 "PFC config"
echo "=== Source rules ===" ; ip rule show | grep -E "100[0-9][0-9]"
echo "=== nvidia_peermem ===" ; lsmod | grep nvidia_peermem
echo "=== Link errors (should be 0) ===" ; \
for n in mlx5_0 mlx5_1 mlx5_2 mlx5_3; do \
echo "$n link_downed: $(cat /sys/class/infiniband/$n/ports/1/counters/link_downed 2>/dev/null)"; \
done
Green = all NICs PORT_ACTIVE, PFC enabled on priority 3, source rules present, nvidia_peermem loaded, link_downed=0 across the board.
If anything is red, walk back through the corresponding section above before running workloads.
What you should bookmark
If you remember nothing else, these are the snippets worth pinning:
-
Confirm RoCE is on the wire
ibv_devinfo -d mlx5_0 | grep -E 'link_layer|active_mtu|state' -
The two-minute health check (section 15 above) — paste it on every fresh host.
-
NCCL production env block
export NCCL_IB_GID_INDEX=3export NCCL_IB_TC=106export NCCL_IB_SL=5export NCCL_IB_QPS_PER_CONNECTION=16export NCCL_CROSS_NIC=0export NCCL_SOCKET_IFNAME=bond0 -
Counter pre/post diff (section 9) — the cleanest way to attribute symptoms to a workload.
-
Lossless config one-liner
sudo mlnx_qos -i mlx5_0 --trust dscp --pfc 0,0,0,1,0,0,0,0 --prio_tc 0,0,0,3,0,0,0,0echo 1 | sudo tee /sys/class/net/mlx5_0/ecn/roce_np/enable/3 \/sys/class/net/mlx5_0/ecn/roce_rp/enable/3sudo ethtool -G mlx5_0 rx 8192 tx 8192 -
Source-based routing for one rail
sudo ip route add 10.0.0.0/16 via 10.0.0.1 dev enp1s0f0 table 101sudo ip rule add from 10.0.0.10 lookup 101 priority 10010sudo sysctl -w net.ipv4.conf.all.arp_ignore=2 \net.ipv4.conf.all.arp_announce=1 \net.ipv4.conf.all.rp_filter=2 -
Throughput sanity test
# serverib_send_bw -d mlx5_0 -R -F -D 30# clientib_write_bw -d mlx5_0 -R -b -F -x 3 -q 16 -s 131072 -t 512 -D 30 \--report_gbits --bind_source_ip=10.0.0.10 <server_ip> -
The canonical lossless chain
DSCP 26 → Priority 3 → TC 3 → Lossless Queue (PFC + ECN) -
Counters that mean "stop, investigate now"
rp_cnp_ignored > 0 → DCQCN tuning bugreq_rnr_retries_exceeded > 0 → QP diedreq_transport_retries_exceeded > 0 → QP diedlink_downed > 0 → bad optic/cable -
What confirms RoCE v2 specifically (vs IB or RoCE v1):
link_layer: Ethernetinibv_devinfo- GID slot 3 =
::ffff:<your_ipv4>with typeRoCE v2 tcpdump 'udp port 4791'captures traffic at line rate during a benchmark
That's the whole single-page reference. Walk the sections top-to-bottom on an unfamiliar host and you'll have a full operational picture in 5 minutes. Walk the troubleshooting recipes when something's wrong and you'll usually find the answer before you have to escalate.