Skip to main content

Host-Side Lossless

The switch-side of lossless RoCE — PFC PAUSE mechanics, ECN marking with WRED, DCQCN tuning, and buffer profiles — lives in the Switch QoS section. Go read that first if you haven't.

This page is the other half. Your job as the network engineer is to make sure the host side matches the switch side. Wrong DSCP marking, wrong trust mode, ECN disabled on the NIC, PFC priority mismatch — every one is silent. The traffic still flows. It just flows badly.

If you master this page, you can diagnose 80% of RoCE performance problems.


Why this page exists

The switch operator configures PFC on priority 3, WRED thresholds for ECN, and a buffer profile that reserves headroom for lossless traffic. None of that helps if:

  • Your host NIC is in PCP trust mode instead of DSCP — the switch's L3 markings get ignored.
  • Your DSCP→TC mapping puts RoCE in the lossy queue — PFC never fires for your traffic.
  • ECN is disabled on the NIC — the switch marks CE, the receiver shrugs, no CNP is generated, DCQCN never reacts.
  • Your ring buffers are too small — microbursts get dropped before PFC has a chance to back-pressure.
  • Your NCCL job sets the wrong traffic class — the NIC marks DSCP 0, the switch puts it in queue 0.

The host config has to agree with the switch. Two ends, same priority, same DSCP, same ECN behavior. Three commands and four sysfs writes, but every one of them is a footgun.


The host-side config trio

Three things must be true on the NIC for lossless RoCE to work:

  1. mlnx_qos — sets the DSCP→Priority→TC mapping and enables PFC on the lossless priority.
  2. /sys/class/net/<nic>/qos/trust — must be dscp, not pcp. Tells the NIC to classify ingress traffic by the L3 DSCP field instead of the L2 PCP bits.
  3. /sys/class/net/<nic>/ecn/roce_{np,rp}/enable/<prio> — enables the ECN Notification Point (the NIC generates CNPs when it sees CE-marked packets) and the Reaction Point (the NIC's DCQCN engine reacts to incoming CNPs by cutting rate).

The mechanics of why each of these matters — PFC PAUSE frames, ECN's CE bit, CNP generation, DCQCN's multiplicative decrease — are explained in the Switch QoS section. Here we focus on what to type on the host.

Trust mode — DSCP vs PCP

# Tell the NIC to classify ingress by L3 DSCP, not L2 PCP
sudo mlnx_qos -i ib0 --trust dscp

# Verify
cat /sys/class/net/ib0/qos/trust
# Expected: dscp

Why this matters: modern data center fabrics rely on L3 (DSCP) because PCP requires VLAN tags and doesn't survive routing boundaries. If trust mode is wrong, your carefully configured DSCP marks get ignored and the NIC falls back to whatever PCP it sees (often nothing useful).

DSCP→Priority→TC mapping

The standard convention is DSCP 26 → priority 3 → TC 3, which gives RoCE a dedicated lossless queue.

# Set DSCP→Priority mapping (DSCP 26 → priority 3)
sudo mlnx_qos -i ib0 --dscp2prio set,26,3

# Set Priority→TC mapping (priority 3 → TC 3)
sudo mlnx_qos -i ib0 --prio_tc 0,0,0,3,0,0,0,0

# Verify
sudo mlnx_qos -i ib0 | head -30

The Mellanox default is "groups of 8" — DSCP 24–31 all map to priority 3 by default (i.e., DSCP/8 = priority). This is robust against single-DSCP-value typos: even if some component marks DSCP 28 instead of 26, it still lands in priority 3. Don't override it unless you have a reason.

Enable PFC on priority 3

# Enable PFC TX + RX on priority 3 only
sudo mlnx_qos -i ib0 --pfc 0,0,0,1,0,0,0,0

# Verify
sudo mlnx_qos -i ib0 | grep -A2 "PFC configuration"
# Expected:
# PFC configuration:
# priority 0 1 2 3 4 5 6 7
# enabled 0 0 0 1 0 0 0 0

Tight scope is intentional. Only enable PFC on the priority that carries RoCE. Enabling PFC on multiple priorities multiplies the risk of deadlock and PFC storms without giving you anything in return.

Enable ECN on priority 3 (NP and RP)

# Notification Point: NIC generates CNPs when it receives CE-marked packets
echo 1 | sudo tee /sys/class/net/ib0/ecn/roce_np/enable/3

# Reaction Point: NIC's DCQCN engine reacts to incoming CNPs (rate cut)
echo 1 | sudo tee /sys/class/net/ib0/ecn/roce_rp/enable/3

# Verify
cat /sys/class/net/ib0/ecn/roce_np/enable/3 # should be 1
cat /sys/class/net/ib0/ecn/roce_rp/enable/3 # should be 1

Both ends must have NP+RP enabled. RoCE is bidirectional — the same NIC is sometimes a sender (RP) and sometimes a receiver (NP). If you only enable NP, your NIC generates CNPs for others but ignores the ones it receives. Half a feedback loop = no feedback.

NVIDIA's default enables ECN on all 8 priorities, which is harmless when only priority 3 carries RoCE. A cleaner deployment restricts ECN to priority 3 only — purely a housekeeping preference.


Ring buffer tuning

NIC TX and RX rings absorb bursts that arrive faster than the NIC's processing pipeline. At 400 Gbps, microbursts are constant, and a small ring will overflow before PFC has a chance to back-pressure the upstream switch.

# Check current ring sizes
ethtool -g ib0

# Bump to maximum (typically 8192 on ConnectX-7)
sudo ethtool -G ib0 rx 8192 tx 8192

# Verify
ethtool -g ib0

For 400G RoCE, 8192/8192 is the working number. Default rings are often 1024–4096, which is fine for slower links but leaves no headroom at 400G. Bigger rings cost a small amount of memory and add a tiny amount of latency at the tail of the ring; for AI workloads where tail latency on RoCE is dominated by DCQCN backoff and not ring depth, this is the right trade.

If your NIC's max ring size is smaller (older ConnectX, or a constrained firmware setting), that's a firmware-level conversation with mlxconfig — not something you can fix with ethtool alone.


The NCCL_IB_TC=106 math

This trips up everyone the first time.

When you configure RoCE via sysfs or NCCL environment variables, you set the full 8-bit TOS byte, not the 6-bit DSCP value. The TOS byte layout:

IPv4 TOS byte:
+---+---+---+---+---+---+---+---+
| DSCP (6 bits) | ECN |
| | (2) |
+---+---+---+---+---+---+---+---+
7 6 5 4 3 2 1 0

DSCP lives in the top 6 bits. The bottom 2 bits are reserved for ECN flags. So to encode "DSCP 26 with ECN-capable transport," you compute:

TOS = (DSCP << 2) | ECN_bits
= (26 << 2) | 10 ← ECT(0) bit pattern is 10 binary = 2 decimal
= 104 | 2
= 106

That's where NCCL_IB_TC=106 comes from. NCCL_IB_TC is the full TOS byte, not just DSCP.

The same gotcha hits sysfs:

# Set the default DSCP for outgoing RoCE traffic on this NIC
echo 26 | sudo tee /sys/class/infiniband/mlx5_0/tc/1/traffic_class
# WRONG — this writes TOS=26, which is DSCP=6 (26 >> 2 = 6)

# Correct:
echo 104 | sudo tee /sys/class/infiniband/mlx5_0/tc/1/traffic_class
# This writes TOS=104, which is DSCP=26 with ECN bits cleared

# Or with ECT(0):
echo 106 | sudo tee /sys/class/infiniband/mlx5_0/tc/1/traffic_class
# TOS=106 = DSCP 26 + ECN ECT(0). This is what NCCL uses.

Cheat sheet:

What you wantComputeValue
DSCP 0 (best effort)0 << 20
DSCP 26 (RoCE, no ECN)26 << 2104
DSCP 26 + ECT(0) (RoCE, ECN-capable)(26 << 2) | 2106
DSCP 46 (EF/voice)46 << 2184

The value that NCCL wants for RoCE is almost always 106. If you're typing 26 anywhere, you're typing wrong.

Setting it for an NCCL job

export NCCL_IB_HCA=ib0,ib1,ib2,ib3 # which NICs to use
export NCCL_IB_GID_INDEX=3 # RoCE v2 GID
export NCCL_IB_TC=106 # TOS = DSCP 26 + ECT(0)
mpirun -np 8 ./your-training-job

The same NCCL_IB_HCA / GID_INDEX setup is covered in NCCL and GPUDirect. The TC value is the lossless-specific knob that pairs the NCCL job to the switch's lossless queue.

Verifying DSCP on the wire

# Capture egress, look at TOS byte
sudo tcpdump -i ib0 -nn -e ip and host <peer> -c 10 -v | grep tos

# Look for "tos 0x68" — that's 104 decimal = DSCP 26
# Or "tos 0x6a" — that's 106 decimal = DSCP 26 + ECT(0)

If you see tos 0x00 on your RDMA packets, your DSCP marking didn't make it. Walk back: trust mode → DSCP-to-TC mapping → NCCL_IB_TC value → NIC's per-port traffic_class sysfs value.


The counter reference card

Every RoCE problem ends with a counter check. The counters you care about live in two separate sysfs trees, and people waste hours looking in the wrong one. Here's the canonical map.

Path 1: /sys/class/infiniband/<dev>/ports/1/hw_counters/

This is the RDMA-specific tree — DCQCN counters, RoCE retransmits, RNR errors. Everything that's specific to "this is RoCE, not just Ethernet" lives here.

CounterMeaningNormal valueWatch when
np_ecn_marked_roce_packetsRX packets arriving with the CE bit setSome during trafficClimbing = network is congested upstream
np_cnp_sentCNPs this NIC generated (acting as receiver/NP)Tracks np_ecn_marked_roce_packetsZero with marks > 0 means NP is disabled — check /sys/.../ecn/roce_np/enable/3
rp_cnp_handledCNPs this NIC reacted to (acting as sender/RP, DCQCN cut rate)Tracks the peer's np_cnp_sentZero = RP is disabled, DCQCN is broken — check /sys/.../ecn/roce_rp/enable/3
rp_cnp_ignoredCNPs received but ignored by hardware (driver/firmware bug)Always zeroAny non-zero value = tuning or firmware bug, file a vendor case
roce_slow_restartDCQCN slow-restart entries (rate cut hard, restart from minimum)Occasional during congestion burstsFrequent = severe sustained congestion or DCQCN too aggressive
out_of_sequenceRX packets arriving out of orderNear zeroClimbing = reorder or drop somewhere in the fabric (ECMP polarization, bad link)
packet_seq_errRDMA sequence errors (lost packet detected)Near zeroClimbing = drops are happening; lossless isn't lossless
roce_adp_retransAdaptive retransmits (RC transport recovering)Near zeroClimbing = RDMA is retransmitting, which it shouldn't have to
req_rnr_retries_exceededRNR (Receiver Not Ready) retries exhaustedZeroAny value = QP failed because peer ran out of receive WRs
req_transport_retries_exceededTransport retries exhaustedZeroAny value = QP failed, ACK never arrived
local_ack_timeout_errLocal ACK timeout (ACK never came back)ZeroAny value = QP failed, probably a peer or fabric issue
out_of_bufferRX work request not posted in time by the appNear zeroClimbing = app isn't posting receive buffers fast enough
rx_read_requestsRDMA READs receivedTracks workloadUse to verify traffic actually flowed
rx_write_requestsRDMA WRITEs receivedTracks workloadUse to verify traffic actually flowed

Path 2: /sys/class/infiniband/<dev>/ports/1/counters/

This is the standard IB-style counter tree — bytes, packets, link state, physical-layer errors. These are the IB-Spec-defined counters that exist for any IB or RoCE device.

CounterMeaningWatch when
port_xmit_dataTotal bytes sent (in octets / 4 — IB convention)Confirm traffic on egress; multiply by 4 for actual bytes
port_rcv_dataTotal bytes received (octets / 4)Confirm traffic on ingress
port_xmit_packetsPackets sentHigh-level traffic stat
port_rcv_packetsPackets receivedHigh-level traffic stat
port_xmit_discardsTX-side dropsAny value = check switch ingress and NIC ring sizes
port_xmit_waitTX wait cycles (proxy for time spent paused by PFC)Climbing = backpressure / PFC fired on this NIC
port_rcv_errorsRX errorsAny value = check cabling / FCS / link quality
link_downedNumber of link-down events since bootAny value = bad optic, cable, fiber, or peer port
link_error_recoveryRecovery events (link bounce)Any value = marginal physical layer
symbol_errorSymbol errorsClimbing = dirty optic or bad fiber
excessive_buffer_overrun_errorsRX buffer overrunAny value = host CPU or PCIe bottleneck, not a fabric problem

Per-priority Ethernet counters via ethtool

There's also a third place to look — ethtool -S <nic> exposes per-priority PFC and byte counters that aren't in either sysfs tree:

ethtool -S ib0 | grep -iE "prio_3|pfc|ecn|cnp|tx_pause|rx_pause"

Key ones:

CounterMeaning
tx_prio_3_packets / tx_prio_3_bytesEgress traffic on priority 3 (your RoCE)
rx_prio_3_packets / rx_prio_3_bytesIngress traffic on priority 3
tx_prio3_pausePAUSE frames this NIC sent upstream (i.e., this NIC was congested)
rx_prio3_pausePAUSE frames this NIC received (i.e., the upstream switch was congested and asked us to stop)
rx_prio3_pause_durationTotal microseconds this NIC was paused

Reading PFC direction: tx_prio3_pause climbing means we asked the switch to stop sending to us — our NIC is the bottleneck. rx_prio3_pause climbing means the switch asked us to stop sending — the switch or downstream is the bottleneck.

Which tree do I look at?

  • "Is DCQCN working?"hw_counters/np_cnp_sent and rp_cnp_handled.
  • "Are we dropping?"hw_counters/packet_seq_err, counters/port_xmit_discards, counters/port_rcv_errors.
  • "Is PFC firing?"ethtool -S | grep pause.
  • "Did traffic actually flow?"counters/port_xmit_data (multiply by 4) and hw_counters/rx_write_requests.
  • "Is the physical link healthy?"counters/link_downed, symbol_error, link_error_recovery.

Pre/post-test counter diff pattern

A single counter snapshot is useless. Counters are cumulative since boot. What you need is the delta across your test — what changed, not what's there.

Use this pattern every time you benchmark RoCE:

#!/bin/bash
# Pre/post counter diff for RoCE testing

DEV=mlx5_0
PORT=1
PRE_DIR=/tmp/counters_pre
POST_DIR=/tmp/counters_post

# === STEP 1: Capture baseline ===
mkdir -p "$PRE_DIR"
for f in /sys/class/infiniband/$DEV/ports/$PORT/hw_counters/*; do
cp "$f" "$PRE_DIR/$(basename $f).pre"
done
for f in /sys/class/infiniband/$DEV/ports/$PORT/counters/*; do
cp "$f" "$PRE_DIR/$(basename $f).pre"
done
ethtool -S ib0 | grep -iE "prio_3|pfc|pause|ecn|cnp" > "$PRE_DIR/ethtool.pre"

# === STEP 2: Run your workload ===
ib_send_bw -d $DEV -x 3 -D 60 <peer_ip>
# Or: mpirun ... your NCCL training job ...

# === STEP 3: Capture post-test, compute deltas ===
echo "=== hw_counters deltas (RDMA-specific) ==="
for f in /sys/class/infiniband/$DEV/ports/$PORT/hw_counters/*; do
name=$(basename $f)
pre=$(cat "$PRE_DIR/$name.pre")
post=$(cat "$f")
delta=$((post - pre))
[ "$delta" -gt 0 ] && echo " $name: pre=$pre post=$post DELTA=$delta"
done

echo ""
echo "=== counters deltas (IB-spec) ==="
for f in /sys/class/infiniband/$DEV/ports/$PORT/counters/*; do
name=$(basename $f)
pre=$(cat "$PRE_DIR/$name.pre")
post=$(cat "$f")
delta=$((post - pre))
[ "$delta" -gt 0 ] && echo " $name: pre=$pre post=$post DELTA=$delta"
done

echo ""
echo "=== ethtool (PFC + per-priority) deltas ==="
ethtool -S ib0 | grep -iE "prio_3|pfc|pause|ecn|cnp" > "$POST_DIR/ethtool.post"
diff "$PRE_DIR/ethtool.pre" "$POST_DIR/ethtool.post" | grep -E "^[<>]"

Interpreting the deltas

After a healthy 60-second ib_send_bw test, you should see:

  • port_xmit_data and port_rcv_data deltas matching your expected bandwidth × duration (remember: multiply by 4 for actual bytes).
  • tx_prio_3_bytes and/or rx_prio_3_bytestx_prio_0_bytes — your traffic landed in the right priority.
  • np_ecn_marked_roce_packets and np_cnp_sent showing modest activity — some marking is normal under sustained load.
  • rp_cnp_handled on the sender side roughly matching the peer's np_cnp_sent — the feedback loop closed.
  • rx_prio3_pause low or zero, tx_prio3_pause low or zero — PFC was the safety net, not the primary mechanism.

After an unhealthy test:

  • packet_seq_err non-zero → drops happened. Check headroom and physical layer.
  • rp_cnp_ignored non-zero → DCQCN tuning or firmware bug.
  • rx_prio3_pause_duration high (more than ~5% of test time) → PFC was firing constantly; ECN isn't keeping queues short enough.
  • tx_prio_0_bytes significant while tx_prio_3_bytes is small → your traffic landed in the wrong queue. DSCP marking or trust mode is broken.

CREATE / VALIDATE / VERIFY / USE — the host-side playbook

This is the operational sequence. CREATE configures. VALIDATE confirms config matches design. VERIFY proves it behaves losslessly under load. USE is how applications consume it.

Watch the create-to-verify loop run on the rockynet lab simulator — trust mode flipped to DSCP, PFC and ECN enabled on priority 3, snapshot the counters at zero, drive ib_write_bw, then watch np_cnp_sent and rp_cnp_handled climb. That's DCQCN doing its job:

MODULE host-networking · LAB 1Watch the recording — every command, every counter, every output.

A. CREATE — configure the NIC

Loop this over ib0, ib1, ib2, ib3 (or whatever your backend NICs are named). In production, this is automated by Ansible / Salt / Puppet — never typed by hand per host.

NIC=ib0 # repeat for each backend NIC

# 1. Trust mode = DSCP (L3 classification)
sudo mlnx_qos -i $NIC --trust dscp

# 2. DSCP 26 → priority 3 → TC 3
sudo mlnx_qos -i $NIC --dscp2prio set,26,3
sudo mlnx_qos -i $NIC --prio_tc 0,0,0,3,0,0,0,0

# 3. PFC enabled on priority 3 only
sudo mlnx_qos -i $NIC --pfc 0,0,0,1,0,0,0,0

# 4. ECN on priority 3, both NP and RP
echo 1 | sudo tee /sys/class/net/$NIC/ecn/roce_np/enable/3
echo 1 | sudo tee /sys/class/net/$NIC/ecn/roce_rp/enable/3

# 5. Max out the rings
sudo ethtool -G $NIC rx 8192 tx 8192

B. VALIDATE — confirm config matches design

# Full mlnx_qos sanity check
sudo mlnx_qos -i ib0 | head -30
# Look for:
# PFC configuration: enabled 0 0 0 1 0 0 0 0
# tc-trust: dscp
# DSCP-priority: dscp 26 -> priority 3
# priority 3 -> TC 3

# ECN enable check (all priorities)
for p in 0 1 2 3 4 5 6 7; do
np=$(cat /sys/class/net/ib0/ecn/roce_np/enable/$p 2>/dev/null)
rp=$(cat /sys/class/net/ib0/ecn/roce_rp/enable/$p 2>/dev/null)
echo "Priority $p: NP=$np RP=$rp"
done
# Expected: Priority 3 NP=1 RP=1

# Trust mode
cat /sys/class/net/ib0/qos/trust
# Expected: dscp

# Cross-NIC consistency — all 4 NICs identical
for nic in ib0 ib1 ib2 ib3; do
echo "=== $nic ==="
sudo mlnx_qos -i $nic | grep -E "PFC|enabled|trust|26.*3"
done
# Drift between NICs = bug

C. VERIFY — prove lossless under load

Validation confirms config matches the design. Verification confirms the design actually delivers losslessness when traffic flows. Use the pre/post counter diff pattern above with a real workload:

# Peer: ib_send_bw -d mlx5_0 -x 3
# This host:
ib_send_bw -d mlx5_0 -x 3 -D 60 <peer_ip>

Then check the deltas. The healthy pattern is lots of bytes, modest ECN/CNP activity, near-zero PFC. The unhealthy pattern is lots of PFC, low CNP — ECN didn't fire in time, PFC caught the overflow, throughput collapsed during pauses.

Interpretation table

Counter deltaExpectedIf different
tx_prio_3_bytestx_prio_0_bytesYesDSCP marking broken — check trust mode + traffic_class sysfs
np_cnp_sentModest, tracks ECN marksZero with marks > 0 = NP disabled
rp_cnp_handledTracks peer's np_cnp_sentLow = RP disabled, DCQCN broken
rx_prio3_pause_duration< 5% of test timeHigh = ECN tuning too lax; switch is hitting XOFF instead of ECN-marking
packet_seq_errZeroAny value = actual drops happened; "lossless" isn't
port_xmit_waitStableClimbing = PFC backpressure is real and frequent

D. USE — how applications consume lossless RDMA

D.1 RDMA apps mark DSCP from the NIC's per-port default

When an app creates a QP, the NIC stamps egress packets with whichever DSCP is configured in the per-port traffic_class (or the app overrides it via traffic_class in the QP creation). Most apps (MPI, NCCL) don't set it explicitly — they inherit the NIC default.

# Set the per-port default for outgoing RoCE traffic
echo 106 | sudo tee /sys/class/infiniband/mlx5_0/tc/1/traffic_class
# 106 = TOS byte for DSCP 26 + ECT(0)

# Verify
cat /sys/class/infiniband/mlx5_0/tc/1/traffic_class

D.2 NCCL — explicit override

For NCCL training jobs, override per-job:

export NCCL_IB_HCA=ib0,ib1,ib2,ib3 # 4-rail backend
export NCCL_IB_GID_INDEX=3 # RoCE v2 GID
export NCCL_IB_TC=106 # DSCP 26 + ECT(0)
mpirun -np 8 ./train.py

NCCL_IB_TC=106 is the canonical setting and the one to memorize. If you see NCCL_IB_TC=26 somewhere in someone's job script, it is wrong — they're getting DSCP 6, not DSCP 26.

D.3 Troubleshoot when DSCP marking is wrong

# Capture egress, look at TOS byte
sudo tcpdump -i ib0 -nn -e ip and host <peer> -c 10 -v | grep tos

# If tos != 0x68 (DSCP 26 no ECN) or 0x6a (DSCP 26 + ECT(0)):
# 1. Check trust mode: cat /sys/class/net/ib0/qos/trust (must be 'dscp')
# 2. Check DSCP-to-TC mapping: sudo mlnx_qos -i ib0 | grep dscp
# 3. Check NCCL_IB_TC value (or the per-port traffic_class sysfs)
# 4. Check the app isn't overriding traffic_class with the wrong value
# 5. Check the app opened the QP on the expected NIC (multi-rail issue)

A sample healthy configuration snapshot

For a sample host running a 4-NIC backend at 400G, the configuration looks like this when correctly deployed:

SettingValueWhy it's right
DCBX modeOS-controlledModern default — the OS owns QoS, not switch-pushed DCBX
Trust modeDSCPL3 classification is the modern path
DSCP→Priority mappinggroups-of-8 (DSCP 24–31 → priority 3)Robust against single-DSCP typos; DSCP 26 lands in priority 3
Priority→TC mappingpriority 3 → TC 3Lossless queue isolated
PFC enabled priorities3 onlyTight scope, no deadlock surface area
Lossless buffer (priority 3)270 KB on a sample host (7m cable)Adequate for short DAC runs; verify under load for longer cables
ECN on priority 3NP=1, RP=1Feedback loop closed both ways
Ring sizes (TX/RX)8192 / 8192At the NIC max for 400G
TSA (Transmission Selection)vendorNIC-default scheduling, no custom carving
4-NIC consistencyIdentical configNo drift between rails

After a fresh boot with no workload, every hw_counter should read zero. The only non-zero value at quiescence is lifespan = 12, which is the kernel refresh interval for the counter file, not a traffic statistic. Once workloads run, you watch the deltas — not the absolute values.


What you should remember

  • Switch-side config + host-side config = lossless. Either side wrong, the whole thing is silently broken. The switch side does PFC PAUSE, ECN marking, and DCQCN-tuned buffer profiles. The host side does mlnx_qos + sysfs + ring tuning.
  • Trust mode must be dscp. PCP trust on a modern fabric means your DSCP marks get ignored. Check /sys/class/net/<nic>/qos/trust.
  • PFC enable only on priority 3. Tight scope = no deadlock. --pfc 0,0,0,1,0,0,0,0 and nothing else.
  • ECN needs NP and RP enabled on priority 3. NP-only means you generate CNPs but ignore the ones you receive. Half a loop is no loop.
  • NCCL_IB_TC=106, not 26. The TOS byte is (DSCP << 2) | ECN_bits. 26 left-shifted by 2 = 104, plus ECT(0) bit = 106. The 26-without-the-shift mistake is the most common RoCE config bug in the wild.
  • Counters live in two trees. hw_counters/ for RDMA-specific stuff (CNP, DCQCN, sequence errors). counters/ for IB-spec basics (bytes, link state, physical). PFC counters are in ethtool -S only.
  • Always use pre/post-test deltas. Cumulative counters since boot tell you nothing about your test. Snapshot before, snapshot after, diff.
  • Healthy load: lots of bytes, modest ECN/CNP, near-zero PFC. Unhealthy load: PFC pause duration ≫ a few % of test time means ECN isn't firing in time — that's a switch-side tuning issue, but you can see it from the host counters.
  • Drift across rails is a bug. All 4 NICs on a multi-rail host must show identical mlnx_qos output. If one drifted, find out why and re-apply the config.

Next: NCCL and GPUDirect → — how training collectives actually use these lossless rails: which NICs NCCL picks, GID indices, GPUDirect RDMA, and the env vars that turn a multi-rail host into one fat pipe.