Host-Side Lossless
The switch-side of lossless RoCE — PFC PAUSE mechanics, ECN marking with WRED, DCQCN tuning, and buffer profiles — lives in the Switch QoS section. Go read that first if you haven't.
This page is the other half. Your job as the network engineer is to make sure the host side matches the switch side. Wrong DSCP marking, wrong trust mode, ECN disabled on the NIC, PFC priority mismatch — every one is silent. The traffic still flows. It just flows badly.
If you master this page, you can diagnose 80% of RoCE performance problems.
Why this page exists
The switch operator configures PFC on priority 3, WRED thresholds for ECN, and a buffer profile that reserves headroom for lossless traffic. None of that helps if:
- Your host NIC is in PCP trust mode instead of DSCP — the switch's L3 markings get ignored.
- Your DSCP→TC mapping puts RoCE in the lossy queue — PFC never fires for your traffic.
- ECN is disabled on the NIC — the switch marks CE, the receiver shrugs, no CNP is generated, DCQCN never reacts.
- Your ring buffers are too small — microbursts get dropped before PFC has a chance to back-pressure.
- Your NCCL job sets the wrong traffic class — the NIC marks DSCP 0, the switch puts it in queue 0.
The host config has to agree with the switch. Two ends, same priority, same DSCP, same ECN behavior. Three commands and four sysfs writes, but every one of them is a footgun.
The host-side config trio
Three things must be true on the NIC for lossless RoCE to work:
mlnx_qos— sets the DSCP→Priority→TC mapping and enables PFC on the lossless priority./sys/class/net/<nic>/qos/trust— must bedscp, notpcp. Tells the NIC to classify ingress traffic by the L3 DSCP field instead of the L2 PCP bits./sys/class/net/<nic>/ecn/roce_{np,rp}/enable/<prio>— enables the ECN Notification Point (the NIC generates CNPs when it sees CE-marked packets) and the Reaction Point (the NIC's DCQCN engine reacts to incoming CNPs by cutting rate).
The mechanics of why each of these matters — PFC PAUSE frames, ECN's CE bit, CNP generation, DCQCN's multiplicative decrease — are explained in the Switch QoS section. Here we focus on what to type on the host.
Trust mode — DSCP vs PCP
# Tell the NIC to classify ingress by L3 DSCP, not L2 PCP
sudo mlnx_qos -i ib0 --trust dscp
# Verify
cat /sys/class/net/ib0/qos/trust
# Expected: dscp
Why this matters: modern data center fabrics rely on L3 (DSCP) because PCP requires VLAN tags and doesn't survive routing boundaries. If trust mode is wrong, your carefully configured DSCP marks get ignored and the NIC falls back to whatever PCP it sees (often nothing useful).
DSCP→Priority→TC mapping
The standard convention is DSCP 26 → priority 3 → TC 3, which gives RoCE a dedicated lossless queue.
# Set DSCP→Priority mapping (DSCP 26 → priority 3)
sudo mlnx_qos -i ib0 --dscp2prio set,26,3
# Set Priority→TC mapping (priority 3 → TC 3)
sudo mlnx_qos -i ib0 --prio_tc 0,0,0,3,0,0,0,0
# Verify
sudo mlnx_qos -i ib0 | head -30
The Mellanox default is "groups of 8" — DSCP 24–31 all map to priority 3 by default (i.e., DSCP/8 = priority). This is robust against single-DSCP-value typos: even if some component marks DSCP 28 instead of 26, it still lands in priority 3. Don't override it unless you have a reason.
Enable PFC on priority 3
# Enable PFC TX + RX on priority 3 only
sudo mlnx_qos -i ib0 --pfc 0,0,0,1,0,0,0,0
# Verify
sudo mlnx_qos -i ib0 | grep -A2 "PFC configuration"
# Expected:
# PFC configuration:
# priority 0 1 2 3 4 5 6 7
# enabled 0 0 0 1 0 0 0 0
Tight scope is intentional. Only enable PFC on the priority that carries RoCE. Enabling PFC on multiple priorities multiplies the risk of deadlock and PFC storms without giving you anything in return.
Enable ECN on priority 3 (NP and RP)
# Notification Point: NIC generates CNPs when it receives CE-marked packets
echo 1 | sudo tee /sys/class/net/ib0/ecn/roce_np/enable/3
# Reaction Point: NIC's DCQCN engine reacts to incoming CNPs (rate cut)
echo 1 | sudo tee /sys/class/net/ib0/ecn/roce_rp/enable/3
# Verify
cat /sys/class/net/ib0/ecn/roce_np/enable/3 # should be 1
cat /sys/class/net/ib0/ecn/roce_rp/enable/3 # should be 1
Both ends must have NP+RP enabled. RoCE is bidirectional — the same NIC is sometimes a sender (RP) and sometimes a receiver (NP). If you only enable NP, your NIC generates CNPs for others but ignores the ones it receives. Half a feedback loop = no feedback.
NVIDIA's default enables ECN on all 8 priorities, which is harmless when only priority 3 carries RoCE. A cleaner deployment restricts ECN to priority 3 only — purely a housekeeping preference.
Ring buffer tuning
NIC TX and RX rings absorb bursts that arrive faster than the NIC's processing pipeline. At 400 Gbps, microbursts are constant, and a small ring will overflow before PFC has a chance to back-pressure the upstream switch.
# Check current ring sizes
ethtool -g ib0
# Bump to maximum (typically 8192 on ConnectX-7)
sudo ethtool -G ib0 rx 8192 tx 8192
# Verify
ethtool -g ib0
For 400G RoCE, 8192/8192 is the working number. Default rings are often 1024–4096, which is fine for slower links but leaves no headroom at 400G. Bigger rings cost a small amount of memory and add a tiny amount of latency at the tail of the ring; for AI workloads where tail latency on RoCE is dominated by DCQCN backoff and not ring depth, this is the right trade.
If your NIC's max ring size is smaller (older ConnectX, or a constrained firmware setting), that's a firmware-level conversation with mlxconfig — not something you can fix with ethtool alone.
The NCCL_IB_TC=106 math
This trips up everyone the first time.
When you configure RoCE via sysfs or NCCL environment variables, you set the full 8-bit TOS byte, not the 6-bit DSCP value. The TOS byte layout:
IPv4 TOS byte:
+---+---+---+---+---+---+---+---+
| DSCP (6 bits) | ECN |
| | (2) |
+---+---+---+---+---+---+---+---+
7 6 5 4 3 2 1 0
DSCP lives in the top 6 bits. The bottom 2 bits are reserved for ECN flags. So to encode "DSCP 26 with ECN-capable transport," you compute:
TOS = (DSCP << 2) | ECN_bits
= (26 << 2) | 10 ← ECT(0) bit pattern is 10 binary = 2 decimal
= 104 | 2
= 106
That's where NCCL_IB_TC=106 comes from. NCCL_IB_TC is the full TOS byte, not just DSCP.
The same gotcha hits sysfs:
# Set the default DSCP for outgoing RoCE traffic on this NIC
echo 26 | sudo tee /sys/class/infiniband/mlx5_0/tc/1/traffic_class
# WRONG — this writes TOS=26, which is DSCP=6 (26 >> 2 = 6)
# Correct:
echo 104 | sudo tee /sys/class/infiniband/mlx5_0/tc/1/traffic_class
# This writes TOS=104, which is DSCP=26 with ECN bits cleared
# Or with ECT(0):
echo 106 | sudo tee /sys/class/infiniband/mlx5_0/tc/1/traffic_class
# TOS=106 = DSCP 26 + ECN ECT(0). This is what NCCL uses.
Cheat sheet:
| What you want | Compute | Value |
|---|---|---|
| DSCP 0 (best effort) | 0 << 2 | 0 |
| DSCP 26 (RoCE, no ECN) | 26 << 2 | 104 |
| DSCP 26 + ECT(0) (RoCE, ECN-capable) | (26 << 2) | 2 | 106 |
| DSCP 46 (EF/voice) | 46 << 2 | 184 |
The value that NCCL wants for RoCE is almost always 106. If you're typing 26 anywhere, you're typing wrong.
Setting it for an NCCL job
export NCCL_IB_HCA=ib0,ib1,ib2,ib3 # which NICs to use
export NCCL_IB_GID_INDEX=3 # RoCE v2 GID
export NCCL_IB_TC=106 # TOS = DSCP 26 + ECT(0)
mpirun -np 8 ./your-training-job
The same NCCL_IB_HCA / GID_INDEX setup is covered in NCCL and GPUDirect. The TC value is the lossless-specific knob that pairs the NCCL job to the switch's lossless queue.
Verifying DSCP on the wire
# Capture egress, look at TOS byte
sudo tcpdump -i ib0 -nn -e ip and host <peer> -c 10 -v | grep tos
# Look for "tos 0x68" — that's 104 decimal = DSCP 26
# Or "tos 0x6a" — that's 106 decimal = DSCP 26 + ECT(0)
If you see tos 0x00 on your RDMA packets, your DSCP marking didn't make it. Walk back: trust mode → DSCP-to-TC mapping → NCCL_IB_TC value → NIC's per-port traffic_class sysfs value.
The counter reference card
Every RoCE problem ends with a counter check. The counters you care about live in two separate sysfs trees, and people waste hours looking in the wrong one. Here's the canonical map.
Path 1: /sys/class/infiniband/<dev>/ports/1/hw_counters/
This is the RDMA-specific tree — DCQCN counters, RoCE retransmits, RNR errors. Everything that's specific to "this is RoCE, not just Ethernet" lives here.
| Counter | Meaning | Normal value | Watch when |
|---|---|---|---|
np_ecn_marked_roce_packets | RX packets arriving with the CE bit set | Some during traffic | Climbing = network is congested upstream |
np_cnp_sent | CNPs this NIC generated (acting as receiver/NP) | Tracks np_ecn_marked_roce_packets | Zero with marks > 0 means NP is disabled — check /sys/.../ecn/roce_np/enable/3 |
rp_cnp_handled | CNPs this NIC reacted to (acting as sender/RP, DCQCN cut rate) | Tracks the peer's np_cnp_sent | Zero = RP is disabled, DCQCN is broken — check /sys/.../ecn/roce_rp/enable/3 |
rp_cnp_ignored | CNPs received but ignored by hardware (driver/firmware bug) | Always zero | Any non-zero value = tuning or firmware bug, file a vendor case |
roce_slow_restart | DCQCN slow-restart entries (rate cut hard, restart from minimum) | Occasional during congestion bursts | Frequent = severe sustained congestion or DCQCN too aggressive |
out_of_sequence | RX packets arriving out of order | Near zero | Climbing = reorder or drop somewhere in the fabric (ECMP polarization, bad link) |
packet_seq_err | RDMA sequence errors (lost packet detected) | Near zero | Climbing = drops are happening; lossless isn't lossless |
roce_adp_retrans | Adaptive retransmits (RC transport recovering) | Near zero | Climbing = RDMA is retransmitting, which it shouldn't have to |
req_rnr_retries_exceeded | RNR (Receiver Not Ready) retries exhausted | Zero | Any value = QP failed because peer ran out of receive WRs |
req_transport_retries_exceeded | Transport retries exhausted | Zero | Any value = QP failed, ACK never arrived |
local_ack_timeout_err | Local ACK timeout (ACK never came back) | Zero | Any value = QP failed, probably a peer or fabric issue |
out_of_buffer | RX work request not posted in time by the app | Near zero | Climbing = app isn't posting receive buffers fast enough |
rx_read_requests | RDMA READs received | Tracks workload | Use to verify traffic actually flowed |
rx_write_requests | RDMA WRITEs received | Tracks workload | Use to verify traffic actually flowed |
Path 2: /sys/class/infiniband/<dev>/ports/1/counters/
This is the standard IB-style counter tree — bytes, packets, link state, physical-layer errors. These are the IB-Spec-defined counters that exist for any IB or RoCE device.
| Counter | Meaning | Watch when |
|---|---|---|
port_xmit_data | Total bytes sent (in octets / 4 — IB convention) | Confirm traffic on egress; multiply by 4 for actual bytes |
port_rcv_data | Total bytes received (octets / 4) | Confirm traffic on ingress |
port_xmit_packets | Packets sent | High-level traffic stat |
port_rcv_packets | Packets received | High-level traffic stat |
port_xmit_discards | TX-side drops | Any value = check switch ingress and NIC ring sizes |
port_xmit_wait | TX wait cycles (proxy for time spent paused by PFC) | Climbing = backpressure / PFC fired on this NIC |
port_rcv_errors | RX errors | Any value = check cabling / FCS / link quality |
link_downed | Number of link-down events since boot | Any value = bad optic, cable, fiber, or peer port |
link_error_recovery | Recovery events (link bounce) | Any value = marginal physical layer |
symbol_error | Symbol errors | Climbing = dirty optic or bad fiber |
excessive_buffer_overrun_errors | RX buffer overrun | Any value = host CPU or PCIe bottleneck, not a fabric problem |
Per-priority Ethernet counters via ethtool
There's also a third place to look — ethtool -S <nic> exposes per-priority PFC and byte counters that aren't in either sysfs tree:
ethtool -S ib0 | grep -iE "prio_3|pfc|ecn|cnp|tx_pause|rx_pause"
Key ones:
| Counter | Meaning |
|---|---|
tx_prio_3_packets / tx_prio_3_bytes | Egress traffic on priority 3 (your RoCE) |
rx_prio_3_packets / rx_prio_3_bytes | Ingress traffic on priority 3 |
tx_prio3_pause | PAUSE frames this NIC sent upstream (i.e., this NIC was congested) |
rx_prio3_pause | PAUSE frames this NIC received (i.e., the upstream switch was congested and asked us to stop) |
rx_prio3_pause_duration | Total microseconds this NIC was paused |
Reading PFC direction: tx_prio3_pause climbing means we asked the switch to stop sending to us — our NIC is the bottleneck. rx_prio3_pause climbing means the switch asked us to stop sending — the switch or downstream is the bottleneck.
Which tree do I look at?
- "Is DCQCN working?" →
hw_counters/np_cnp_sentandrp_cnp_handled. - "Are we dropping?" →
hw_counters/packet_seq_err,counters/port_xmit_discards,counters/port_rcv_errors. - "Is PFC firing?" →
ethtool -S | grep pause. - "Did traffic actually flow?" →
counters/port_xmit_data(multiply by 4) andhw_counters/rx_write_requests. - "Is the physical link healthy?" →
counters/link_downed,symbol_error,link_error_recovery.
Pre/post-test counter diff pattern
A single counter snapshot is useless. Counters are cumulative since boot. What you need is the delta across your test — what changed, not what's there.
Use this pattern every time you benchmark RoCE:
#!/bin/bash
# Pre/post counter diff for RoCE testing
DEV=mlx5_0
PORT=1
PRE_DIR=/tmp/counters_pre
POST_DIR=/tmp/counters_post
# === STEP 1: Capture baseline ===
mkdir -p "$PRE_DIR"
for f in /sys/class/infiniband/$DEV/ports/$PORT/hw_counters/*; do
cp "$f" "$PRE_DIR/$(basename $f).pre"
done
for f in /sys/class/infiniband/$DEV/ports/$PORT/counters/*; do
cp "$f" "$PRE_DIR/$(basename $f).pre"
done
ethtool -S ib0 | grep -iE "prio_3|pfc|pause|ecn|cnp" > "$PRE_DIR/ethtool.pre"
# === STEP 2: Run your workload ===
ib_send_bw -d $DEV -x 3 -D 60 <peer_ip>
# Or: mpirun ... your NCCL training job ...
# === STEP 3: Capture post-test, compute deltas ===
echo "=== hw_counters deltas (RDMA-specific) ==="
for f in /sys/class/infiniband/$DEV/ports/$PORT/hw_counters/*; do
name=$(basename $f)
pre=$(cat "$PRE_DIR/$name.pre")
post=$(cat "$f")
delta=$((post - pre))
[ "$delta" -gt 0 ] && echo " $name: pre=$pre post=$post DELTA=$delta"
done
echo ""
echo "=== counters deltas (IB-spec) ==="
for f in /sys/class/infiniband/$DEV/ports/$PORT/counters/*; do
name=$(basename $f)
pre=$(cat "$PRE_DIR/$name.pre")
post=$(cat "$f")
delta=$((post - pre))
[ "$delta" -gt 0 ] && echo " $name: pre=$pre post=$post DELTA=$delta"
done
echo ""
echo "=== ethtool (PFC + per-priority) deltas ==="
ethtool -S ib0 | grep -iE "prio_3|pfc|pause|ecn|cnp" > "$POST_DIR/ethtool.post"
diff "$PRE_DIR/ethtool.pre" "$POST_DIR/ethtool.post" | grep -E "^[<>]"
Interpreting the deltas
After a healthy 60-second ib_send_bw test, you should see:
port_xmit_dataandport_rcv_datadeltas matching your expected bandwidth × duration (remember: multiply by 4 for actual bytes).tx_prio_3_bytesand/orrx_prio_3_bytes≫tx_prio_0_bytes— your traffic landed in the right priority.np_ecn_marked_roce_packetsandnp_cnp_sentshowing modest activity — some marking is normal under sustained load.rp_cnp_handledon the sender side roughly matching the peer'snp_cnp_sent— the feedback loop closed.rx_prio3_pauselow or zero,tx_prio3_pauselow or zero — PFC was the safety net, not the primary mechanism.
After an unhealthy test:
packet_seq_errnon-zero → drops happened. Check headroom and physical layer.rp_cnp_ignorednon-zero → DCQCN tuning or firmware bug.rx_prio3_pause_durationhigh (more than ~5% of test time) → PFC was firing constantly; ECN isn't keeping queues short enough.tx_prio_0_bytessignificant whiletx_prio_3_bytesis small → your traffic landed in the wrong queue. DSCP marking or trust mode is broken.
CREATE / VALIDATE / VERIFY / USE — the host-side playbook
This is the operational sequence. CREATE configures. VALIDATE confirms config matches design. VERIFY proves it behaves losslessly under load. USE is how applications consume it.
Watch the create-to-verify loop run on the rockynet lab simulator — trust mode flipped to DSCP, PFC and ECN enabled on priority 3, snapshot the counters at zero, drive ib_write_bw, then watch np_cnp_sent and rp_cnp_handled climb. That's DCQCN doing its job:
A. CREATE — configure the NIC
Loop this over ib0, ib1, ib2, ib3 (or whatever your backend NICs are named). In production, this is automated by Ansible / Salt / Puppet — never typed by hand per host.
NIC=ib0 # repeat for each backend NIC
# 1. Trust mode = DSCP (L3 classification)
sudo mlnx_qos -i $NIC --trust dscp
# 2. DSCP 26 → priority 3 → TC 3
sudo mlnx_qos -i $NIC --dscp2prio set,26,3
sudo mlnx_qos -i $NIC --prio_tc 0,0,0,3,0,0,0,0
# 3. PFC enabled on priority 3 only
sudo mlnx_qos -i $NIC --pfc 0,0,0,1,0,0,0,0
# 4. ECN on priority 3, both NP and RP
echo 1 | sudo tee /sys/class/net/$NIC/ecn/roce_np/enable/3
echo 1 | sudo tee /sys/class/net/$NIC/ecn/roce_rp/enable/3
# 5. Max out the rings
sudo ethtool -G $NIC rx 8192 tx 8192
B. VALIDATE — confirm config matches design
# Full mlnx_qos sanity check
sudo mlnx_qos -i ib0 | head -30
# Look for:
# PFC configuration: enabled 0 0 0 1 0 0 0 0
# tc-trust: dscp
# DSCP-priority: dscp 26 -> priority 3
# priority 3 -> TC 3
# ECN enable check (all priorities)
for p in 0 1 2 3 4 5 6 7; do
np=$(cat /sys/class/net/ib0/ecn/roce_np/enable/$p 2>/dev/null)
rp=$(cat /sys/class/net/ib0/ecn/roce_rp/enable/$p 2>/dev/null)
echo "Priority $p: NP=$np RP=$rp"
done
# Expected: Priority 3 NP=1 RP=1
# Trust mode
cat /sys/class/net/ib0/qos/trust
# Expected: dscp
# Cross-NIC consistency — all 4 NICs identical
for nic in ib0 ib1 ib2 ib3; do
echo "=== $nic ==="
sudo mlnx_qos -i $nic | grep -E "PFC|enabled|trust|26.*3"
done
# Drift between NICs = bug
C. VERIFY — prove lossless under load
Validation confirms config matches the design. Verification confirms the design actually delivers losslessness when traffic flows. Use the pre/post counter diff pattern above with a real workload:
# Peer: ib_send_bw -d mlx5_0 -x 3
# This host:
ib_send_bw -d mlx5_0 -x 3 -D 60 <peer_ip>
Then check the deltas. The healthy pattern is lots of bytes, modest ECN/CNP activity, near-zero PFC. The unhealthy pattern is lots of PFC, low CNP — ECN didn't fire in time, PFC caught the overflow, throughput collapsed during pauses.
Interpretation table
| Counter delta | Expected | If different |
|---|---|---|
tx_prio_3_bytes ≫ tx_prio_0_bytes | Yes | DSCP marking broken — check trust mode + traffic_class sysfs |
np_cnp_sent | Modest, tracks ECN marks | Zero with marks > 0 = NP disabled |
rp_cnp_handled | Tracks peer's np_cnp_sent | Low = RP disabled, DCQCN broken |
rx_prio3_pause_duration | < 5% of test time | High = ECN tuning too lax; switch is hitting XOFF instead of ECN-marking |
packet_seq_err | Zero | Any value = actual drops happened; "lossless" isn't |
port_xmit_wait | Stable | Climbing = PFC backpressure is real and frequent |
D. USE — how applications consume lossless RDMA
D.1 RDMA apps mark DSCP from the NIC's per-port default
When an app creates a QP, the NIC stamps egress packets with whichever DSCP is configured in the per-port traffic_class (or the app overrides it via traffic_class in the QP creation). Most apps (MPI, NCCL) don't set it explicitly — they inherit the NIC default.
# Set the per-port default for outgoing RoCE traffic
echo 106 | sudo tee /sys/class/infiniband/mlx5_0/tc/1/traffic_class
# 106 = TOS byte for DSCP 26 + ECT(0)
# Verify
cat /sys/class/infiniband/mlx5_0/tc/1/traffic_class
D.2 NCCL — explicit override
For NCCL training jobs, override per-job:
export NCCL_IB_HCA=ib0,ib1,ib2,ib3 # 4-rail backend
export NCCL_IB_GID_INDEX=3 # RoCE v2 GID
export NCCL_IB_TC=106 # DSCP 26 + ECT(0)
mpirun -np 8 ./train.py
NCCL_IB_TC=106 is the canonical setting and the one to memorize. If you see NCCL_IB_TC=26 somewhere in someone's job script, it is wrong — they're getting DSCP 6, not DSCP 26.
D.3 Troubleshoot when DSCP marking is wrong
# Capture egress, look at TOS byte
sudo tcpdump -i ib0 -nn -e ip and host <peer> -c 10 -v | grep tos
# If tos != 0x68 (DSCP 26 no ECN) or 0x6a (DSCP 26 + ECT(0)):
# 1. Check trust mode: cat /sys/class/net/ib0/qos/trust (must be 'dscp')
# 2. Check DSCP-to-TC mapping: sudo mlnx_qos -i ib0 | grep dscp
# 3. Check NCCL_IB_TC value (or the per-port traffic_class sysfs)
# 4. Check the app isn't overriding traffic_class with the wrong value
# 5. Check the app opened the QP on the expected NIC (multi-rail issue)
A sample healthy configuration snapshot
For a sample host running a 4-NIC backend at 400G, the configuration looks like this when correctly deployed:
| Setting | Value | Why it's right |
|---|---|---|
| DCBX mode | OS-controlled | Modern default — the OS owns QoS, not switch-pushed DCBX |
| Trust mode | DSCP | L3 classification is the modern path |
| DSCP→Priority mapping | groups-of-8 (DSCP 24–31 → priority 3) | Robust against single-DSCP typos; DSCP 26 lands in priority 3 |
| Priority→TC mapping | priority 3 → TC 3 | Lossless queue isolated |
| PFC enabled priorities | 3 only | Tight scope, no deadlock surface area |
| Lossless buffer (priority 3) | 270 KB on a sample host (7m cable) | Adequate for short DAC runs; verify under load for longer cables |
| ECN on priority 3 | NP=1, RP=1 | Feedback loop closed both ways |
| Ring sizes (TX/RX) | 8192 / 8192 | At the NIC max for 400G |
| TSA (Transmission Selection) | vendor | NIC-default scheduling, no custom carving |
| 4-NIC consistency | Identical config | No drift between rails |
After a fresh boot with no workload, every hw_counter should read zero. The only non-zero value at quiescence is lifespan = 12, which is the kernel refresh interval for the counter file, not a traffic statistic. Once workloads run, you watch the deltas — not the absolute values.
What you should remember
- Switch-side config + host-side config = lossless. Either side wrong, the whole thing is silently broken. The switch side does PFC PAUSE, ECN marking, and DCQCN-tuned buffer profiles. The host side does mlnx_qos + sysfs + ring tuning.
- Trust mode must be
dscp. PCP trust on a modern fabric means your DSCP marks get ignored. Check/sys/class/net/<nic>/qos/trust. - PFC enable only on priority 3. Tight scope = no deadlock.
--pfc 0,0,0,1,0,0,0,0and nothing else. - ECN needs NP and RP enabled on priority 3. NP-only means you generate CNPs but ignore the ones you receive. Half a loop is no loop.
NCCL_IB_TC=106, not 26. The TOS byte is(DSCP << 2) | ECN_bits. 26 left-shifted by 2 = 104, plus ECT(0) bit = 106. The 26-without-the-shift mistake is the most common RoCE config bug in the wild.- Counters live in two trees.
hw_counters/for RDMA-specific stuff (CNP, DCQCN, sequence errors).counters/for IB-spec basics (bytes, link state, physical). PFC counters are inethtool -Sonly. - Always use pre/post-test deltas. Cumulative counters since boot tell you nothing about your test. Snapshot before, snapshot after, diff.
- Healthy load: lots of bytes, modest ECN/CNP, near-zero PFC. Unhealthy load: PFC pause duration ≫ a few % of test time means ECN isn't firing in time — that's a switch-side tuning issue, but you can see it from the host counters.
- Drift across rails is a bug. All 4 NICs on a multi-rail host must show identical mlnx_qos output. If one drifted, find out why and re-apply the config.
Next: NCCL and GPUDirect → — how training collectives actually use these lossless rails: which NICs NCCL picks, GID indices, GPUDirect RDMA, and the env vars that turn a multi-rail host into one fat pipe.