Host-Side Lossless

The switch-side of lossless RoCE — PFC PAUSE mechanics, ECN marking with WRED, DCQCN tuning, and buffer profiles — lives in the Switch QoS section. Go read that first if you haven't.

This page is the other half. Your job as the network engineer is to make sure the host side matches the switch side. Wrong DSCP marking, wrong trust mode, ECN disabled on the NIC, PFC priority mismatch — every one is silent. The traffic still flows. It just flows badly.

If you master this page, you can diagnose 80% of RoCE performance problems.

After this page, you'll be able to

Configure the host-side trio — mlnx_qos --trust dscp, the DSCP 26 → priority 3 → TC 3 mapping, --pfc 0,0,0,1,0,0,0,0, and ECN NP+RP enabled in /sys/.../ecn/roce_{np,rp}/enable/3.
Do the NCCL_IB_TC=106 math — the TOS byte is (DSCP << 2) | ECN_bits, so DSCP 26 + ECT(0) = 106, and NCCL_IB_TC=26 is the classic bug that ships DSCP 6.
Read the counter trees — hw_counters/ for np_cnp_sent / rp_cnp_handled / packet_seq_err, counters/ for bytes and link health, and ethtool -S for per-priority PFC pause direction.
Run the pre/post-test delta pattern — snapshot before and after a workload, then read a healthy result (lots of bytes, modest ECN/CNP, near-zero PFC) versus an unhealthy one.

Why this page exists

The switch operator configures PFC on priority 3, WRED thresholds for ECN, and a buffer profile that reserves headroom for lossless traffic. None of that helps if:

Your host NIC is in PCP trust mode instead of DSCP — the switch's L3 markings get ignored.
Your DSCP→TC mapping puts RoCE in the lossy queue — PFC never fires for your traffic.
ECN is disabled on the NIC — the switch marks CE, the receiver shrugs, no CNP is generated, DCQCN never reacts.
Your ring buffers are too small — microbursts get dropped before PFC has a chance to back-pressure.
Your NCCL job sets the wrong traffic class — the NIC marks DSCP 0, the switch puts it in queue 0.

The host config has to agree with the switch. Two ends, same priority, same DSCP, same ECN behavior. Three commands and four sysfs writes, but every one of them is a footgun.

The host-side config trio

Three things must be true on the NIC for lossless RoCE to work:

mlnx_qos — sets the DSCP→Priority→TC mapping and enables PFC on the lossless priority.
/sys/class/net/<nic>/qos/trust — must be dscp, not pcp. Tells the NIC to classify ingress traffic by the L3 DSCP field instead of the L2 PCP bits.
/sys/class/net/<nic>/ecn/roce_{np,rp}/enable/<prio> — enables the ECN Notification Point (the NIC generates CNPs when it sees CE-marked packets) and the Reaction Point (the NIC's DCQCN engine reacts to incoming CNPs by cutting rate).

The mechanics of why each of these matters — PFC PAUSE frames, ECN's CE bit, CNP generation, DCQCN's multiplicative decrease — are explained in the Switch QoS section. Here we focus on what to type on the host.

Trust mode — DSCP vs PCP

# Tell the NIC to classify ingress by L3 DSCP, not L2 PCP
sudo mlnx_qos -i ib0 --trust dscp

# Verify
cat /sys/class/net/ib0/qos/trust
# Expected: dscp

Why this matters: modern data center fabrics rely on L3 (DSCP) because PCP requires VLAN tags and doesn't survive routing boundaries. If trust mode is wrong, your carefully configured DSCP marks get ignored and the NIC falls back to whatever PCP it sees (often nothing useful).

DSCP→Priority→TC mapping

The standard convention is DSCP 26 → priority 3 → TC 3, which gives RoCE a dedicated lossless queue.

# Set DSCP→Priority mapping (DSCP 26 → priority 3)
sudo mlnx_qos -i ib0 --dscp2prio set,26,3

# Set Priority→TC mapping (priority 3 → TC 3)
sudo mlnx_qos -i ib0 --prio_tc 0,0,0,3,0,0,0,0

# Verify
sudo mlnx_qos -i ib0 | head -30

The Mellanox default is "groups of 8" — DSCP 24–31 all map to priority 3 by default (i.e., DSCP/8 = priority). This is robust against single-DSCP-value typos: even if some component marks DSCP 28 instead of 26, it still lands in priority 3. Don't override it unless you have a reason.

Enable PFC on priority 3

# Enable PFC TX + RX on priority 3 only
sudo mlnx_qos -i ib0 --pfc 0,0,0,1,0,0,0,0

# Verify
sudo mlnx_qos -i ib0 | grep -A2 "PFC configuration"
# Expected:
#   PFC configuration:
#       priority   0  1  2  3  4  5  6  7
#       enabled    0  0  0  1  0  0  0  0

Tight scope is intentional. Only enable PFC on the priority that carries RoCE. Enabling PFC on multiple priorities multiplies the risk of deadlock and PFC storms without giving you anything in return.

Enable ECN on priority 3 (NP and RP)

# Notification Point: NIC generates CNPs when it receives CE-marked packets
echo 1 | sudo tee /sys/class/net/ib0/ecn/roce_np/enable/3

# Reaction Point: NIC's DCQCN engine reacts to incoming CNPs (rate cut)
echo 1 | sudo tee /sys/class/net/ib0/ecn/roce_rp/enable/3

# Verify
cat /sys/class/net/ib0/ecn/roce_np/enable/3   # should be 1
cat /sys/class/net/ib0/ecn/roce_rp/enable/3   # should be 1

Both ends must have NP+RP enabled. RoCE is bidirectional — the same NIC is sometimes a sender (RP) and sometimes a receiver (NP). If you only enable NP, your NIC generates CNPs for others but ignores the ones it receives. Half a feedback loop = no feedback.

NVIDIA's default enables ECN on all 8 priorities, which is harmless when only priority 3 carries RoCE. A cleaner deployment restricts ECN to priority 3 only — purely a housekeeping preference.

Ring buffer tuning

NIC TX and RX rings absorb bursts that arrive faster than the NIC's processing pipeline. At 400 Gbps, microbursts are constant, and a small ring will overflow before PFC has a chance to back-pressure the upstream switch.

# Check current ring sizes
ethtool -g ib0

# Bump to maximum (typically 8192 on ConnectX-7)
sudo ethtool -G ib0 rx 8192 tx 8192

# Verify
ethtool -g ib0

For 400G RoCE, 8192/8192 is the working number. Default rings are often 1024–4096, which is fine for slower links but leaves no headroom at 400G. Bigger rings cost a small amount of memory and add a tiny amount of latency at the tail of the ring; for AI workloads where tail latency on RoCE is dominated by DCQCN backoff and not ring depth, this is the right trade.

If your NIC's max ring size is smaller (older ConnectX, or a constrained firmware setting), that's a firmware-level conversation with mlxconfig — not something you can fix with ethtool alone.

The NCCL_IB_TC=106 math

This trips up everyone the first time.

When you configure RoCE via sysfs or NCCL environment variables, you set the full 8-bit TOS byte, not the 6-bit DSCP value. The TOS byte layout:

IPv4 TOS byte:
+---+---+---+---+---+---+---+---+
|     DSCP (6 bits)     | ECN  |
|                       | (2)  |
+---+---+---+---+---+---+---+---+
  7   6   5   4   3   2   1   0

DSCP lives in the top 6 bits. The bottom 2 bits are reserved for ECN flags. So to encode "DSCP 26 with ECN-capable transport," you compute:

TOS = (DSCP << 2) | ECN_bits
    = (26    << 2) | 10        ← ECT(0) bit pattern is 10 binary = 2 decimal
    = 104          | 2
    = 106

That's where NCCL_IB_TC=106 comes from. NCCL_IB_TC is the full TOS byte, not just DSCP.

The same gotcha hits sysfs:

# Set the default DSCP for outgoing RoCE traffic on this NIC
echo 26 | sudo tee /sys/class/infiniband/mlx5_0/tc/1/traffic_class
# WRONG — this writes TOS=26, which is DSCP=6 (26 >> 2 = 6)

# Correct:
echo 104 | sudo tee /sys/class/infiniband/mlx5_0/tc/1/traffic_class
# This writes TOS=104, which is DSCP=26 with ECN bits cleared

# Or with ECT(0):
echo 106 | sudo tee /sys/class/infiniband/mlx5_0/tc/1/traffic_class
# TOS=106 = DSCP 26 + ECN ECT(0). This is what NCCL uses.

Cheat sheet:

What you want	Compute	Value
DSCP 0 (best effort)	`0 << 2`	`0`
DSCP 26 (RoCE, no ECN)	`26 << 2`	`104`
DSCP 26 + ECT(0) (RoCE, ECN-capable)	`(26 << 2) \| 2`	`106`
DSCP 46 (EF/voice)	`46 << 2`	`184`

The value that NCCL wants for RoCE is almost always 106. If you're typing 26 anywhere, you're typing wrong.

Setting it for an NCCL job

export NCCL_IB_HCA=ib0,ib1,ib2,ib3       # which NICs to use
export NCCL_IB_GID_INDEX=3                # RoCE v2 GID
export NCCL_IB_TC=106                     # TOS = DSCP 26 + ECT(0)
mpirun -np 8 ./your-training-job

The same NCCL_IB_HCA / GID_INDEX setup is covered in NCCL and GPUDirect. The TC value is the lossless-specific knob that pairs the NCCL job to the switch's lossless queue.

Verifying DSCP on the wire

# Capture egress, look at TOS byte
sudo tcpdump -i ib0 -nn -e ip and host <peer> -c 10 -v | grep tos

# Look for "tos 0x68" — that's 104 decimal = DSCP 26
# Or "tos 0x6a" — that's 106 decimal = DSCP 26 + ECT(0)

If you see tos 0x00 on your RDMA packets, your DSCP marking didn't make it. Walk back: trust mode → DSCP-to-TC mapping → NCCL_IB_TC value → NIC's per-port traffic_class sysfs value.

The counter reference card

Every RoCE problem ends with a counter check. The counters you care about live in two separate sysfs trees, and people waste hours looking in the wrong one. Here's the canonical map.

Path 1: `/sys/class/infiniband/<dev>/ports/1/hw_counters/`

This is the RDMA-specific tree — DCQCN counters, RoCE retransmits, RNR errors. Everything that's specific to "this is RoCE, not just Ethernet" lives here.

Counter	Meaning	Normal value	Watch when
`np_ecn_marked_roce_packets`	RX packets arriving with the CE bit set	Some during traffic	Climbing = network is congested upstream
`np_cnp_sent`	CNPs this NIC generated (acting as receiver/NP)	Tracks `np_ecn_marked_roce_packets`	Zero with marks > 0 means NP is disabled — check `/sys/.../ecn/roce_np/enable/3`
`rp_cnp_handled`	CNPs this NIC reacted to (acting as sender/RP, DCQCN cut rate)	Tracks the peer's `np_cnp_sent`	Zero = RP is disabled, DCQCN is broken — check `/sys/.../ecn/roce_rp/enable/3`
`rp_cnp_ignored`	CNPs received but ignored by hardware (driver/firmware bug)	Always zero	Any non-zero value = tuning or firmware bug, file a vendor case
`roce_slow_restart`	DCQCN slow-restart entries (rate cut hard, restart from minimum)	Occasional during congestion bursts	Frequent = severe sustained congestion or DCQCN too aggressive
`out_of_sequence`	RX packets arriving out of order	Near zero	Climbing = reorder or drop somewhere in the fabric (ECMP polarization, bad link)
`packet_seq_err`	RDMA sequence errors (lost packet detected)	Near zero	Climbing = drops are happening; lossless isn't lossless
`roce_adp_retrans`	Adaptive retransmits (RC transport recovering)	Near zero	Climbing = RDMA is retransmitting, which it shouldn't have to
`req_rnr_retries_exceeded`	RNR (Receiver Not Ready) retries exhausted	Zero	Any value = QP failed because peer ran out of receive WRs
`req_transport_retries_exceeded`	Transport retries exhausted	Zero	Any value = QP failed, ACK never arrived
`local_ack_timeout_err`	Local ACK timeout (ACK never came back)	Zero	Any value = QP failed, probably a peer or fabric issue
`out_of_buffer`	RX work request not posted in time by the app	Near zero	Climbing = app isn't posting receive buffers fast enough
`rx_read_requests`	RDMA READs received	Tracks workload	Use to verify traffic actually flowed
`rx_write_requests`	RDMA WRITEs received	Tracks workload	Use to verify traffic actually flowed

Path 2: `/sys/class/infiniband/<dev>/ports/1/counters/`

This is the standard IB-style counter tree — bytes, packets, link state, physical-layer errors. These are the IB-Spec-defined counters that exist for any IB or RoCE device.

Counter	Meaning	Watch when
`port_xmit_data`	Total bytes sent (in octets / 4 — IB convention)	Confirm traffic on egress; multiply by 4 for actual bytes
`port_rcv_data`	Total bytes received (octets / 4)	Confirm traffic on ingress
`port_xmit_packets`	Packets sent	High-level traffic stat
`port_rcv_packets`	Packets received	High-level traffic stat
`port_xmit_discards`	TX-side drops	Any value = check switch ingress and NIC ring sizes
`port_xmit_wait`	TX wait cycles (proxy for time spent paused by PFC)	Climbing = backpressure / PFC fired on this NIC
`port_rcv_errors`	RX errors	Any value = check cabling / FCS / link quality
`link_downed`	Number of link-down events since boot	Any value = bad optic, cable, fiber, or peer port
`link_error_recovery`	Recovery events (link bounce)	Any value = marginal physical layer
`symbol_error`	Symbol errors	Climbing = dirty optic or bad fiber
`excessive_buffer_overrun_errors`	RX buffer overrun	Any value = host CPU or PCIe bottleneck, not a fabric problem

Per-priority Ethernet counters via ethtool

There's also a third place to look — ethtool -S <nic> exposes per-priority PFC and byte counters that aren't in either sysfs tree:

ethtool -S ib0 | grep -iE "prio_3|pfc|ecn|cnp|tx_pause|rx_pause"

Key ones:

Counter	Meaning
`tx_prio_3_packets` / `tx_prio_3_bytes`	Egress traffic on priority 3 (your RoCE)
`rx_prio_3_packets` / `rx_prio_3_bytes`	Ingress traffic on priority 3
`tx_prio3_pause`	PAUSE frames this NIC sent upstream (i.e., this NIC was congested)
`rx_prio3_pause`	PAUSE frames this NIC received (i.e., the upstream switch was congested and asked us to stop)
`rx_prio3_pause_duration`	Total microseconds this NIC was paused

Reading PFC direction: tx_prio3_pause climbing means we asked the switch to stop sending to us — our NIC is the bottleneck. rx_prio3_pause climbing means the switch asked us to stop sending — the switch or downstream is the bottleneck.

Which tree do I look at?

"Is DCQCN working?" → hw_counters/np_cnp_sent and rp_cnp_handled.
"Are we dropping?" → hw_counters/packet_seq_err, counters/port_xmit_discards, counters/port_rcv_errors.
"Is PFC firing?" → ethtool -S | grep pause.
"Did traffic actually flow?" → counters/port_xmit_data (multiply by 4) and hw_counters/rx_write_requests.
"Is the physical link healthy?" → counters/link_downed, symbol_error, link_error_recovery.

Pre/post-test counter diff pattern

A single counter snapshot is useless. Counters are cumulative since boot. What you need is the delta across your test — what changed, not what's there.

Use this pattern every time you benchmark RoCE:

#!/bin/bash
# Pre/post counter diff for RoCE testing

DEV=mlx5_0
PORT=1
PRE_DIR=/tmp/counters_pre
POST_DIR=/tmp/counters_post

# === STEP 1: Capture baseline ===
mkdir -p "$PRE_DIR"
for f in /sys/class/infiniband/$DEV/ports/$PORT/hw_counters/*; do
  cp "$f" "$PRE_DIR/$(basename $f).pre"
done
for f in /sys/class/infiniband/$DEV/ports/$PORT/counters/*; do
  cp "$f" "$PRE_DIR/$(basename $f).pre"
done
ethtool -S ib0 | grep -iE "prio_3|pfc|pause|ecn|cnp" > "$PRE_DIR/ethtool.pre"

# === STEP 2: Run your workload ===
ib_send_bw -d $DEV -x 3 -D 60 <peer_ip>
# Or: mpirun ... your NCCL training job ...

# === STEP 3: Capture post-test, compute deltas ===
echo "=== hw_counters deltas (RDMA-specific) ==="
for f in /sys/class/infiniband/$DEV/ports/$PORT/hw_counters/*; do
  name=$(basename $f)
  pre=$(cat "$PRE_DIR/$name.pre")
  post=$(cat "$f")
  delta=$((post - pre))
  [ "$delta" -gt 0 ] && echo "  $name: pre=$pre post=$post DELTA=$delta"
done

echo ""
echo "=== counters deltas (IB-spec) ==="
for f in /sys/class/infiniband/$DEV/ports/$PORT/counters/*; do
  name=$(basename $f)
  pre=$(cat "$PRE_DIR/$name.pre")
  post=$(cat "$f")
  delta=$((post - pre))
  [ "$delta" -gt 0 ] && echo "  $name: pre=$pre post=$post DELTA=$delta"
done

echo ""
echo "=== ethtool (PFC + per-priority) deltas ==="
ethtool -S ib0 | grep -iE "prio_3|pfc|pause|ecn|cnp" > "$POST_DIR/ethtool.post"
diff "$PRE_DIR/ethtool.pre" "$POST_DIR/ethtool.post" | grep -E "^[<>]"

Interpreting the deltas

After a healthy 60-second ib_send_bw test, you should see:

port_xmit_data and port_rcv_data deltas matching your expected bandwidth × duration (remember: multiply by 4 for actual bytes).
tx_prio_3_bytes and/or rx_prio_3_bytes ≫ tx_prio_0_bytes — your traffic landed in the right priority.
np_ecn_marked_roce_packets and np_cnp_sent showing modest activity — some marking is normal under sustained load.
rp_cnp_handled on the sender side roughly matching the peer's np_cnp_sent — the feedback loop closed.
rx_prio3_pause low or zero, tx_prio3_pause low or zero — PFC was the safety net, not the primary mechanism.

After an unhealthy test:

packet_seq_err non-zero → drops happened. Check headroom and physical layer.
rp_cnp_ignored non-zero → DCQCN tuning or firmware bug.
rx_prio3_pause_duration high (more than ~5% of test time) → PFC was firing constantly; ECN isn't keeping queues short enough.
tx_prio_0_bytes significant while tx_prio_3_bytes is small → your traffic landed in the wrong queue. DSCP marking or trust mode is broken.

CREATE / VALIDATE / VERIFY / USE — the host-side playbook

This is the operational sequence. CREATE configures. VALIDATE confirms config matches design. VERIFY proves it behaves losslessly under load. USE is how applications consume it.

Watch the create-to-verify loop run on the rockynet lab simulator — trust mode flipped to DSCP, PFC and ECN enabled on priority 3, snapshot the counters at zero, drive ib_write_bw, then watch np_cnp_sent and rp_cnp_handled climb. That's DCQCN doing its job:

MODULE host-networking · LAB 1Watch the recording — every command, every counter, every output.

A. CREATE — configure the NIC

Loop this over ib0, ib1, ib2, ib3 (or whatever your backend NICs are named). In production, this is automated by Ansible / Salt / Puppet — never typed by hand per host.

NIC=ib0   # repeat for each backend NIC

# 1. Trust mode = DSCP (L3 classification)
sudo mlnx_qos -i $NIC --trust dscp

# 2. DSCP 26 → priority 3 → TC 3
sudo mlnx_qos -i $NIC --dscp2prio set,26,3
sudo mlnx_qos -i $NIC --prio_tc 0,0,0,3,0,0,0,0

# 3. PFC enabled on priority 3 only
sudo mlnx_qos -i $NIC --pfc 0,0,0,1,0,0,0,0

# 4. ECN on priority 3, both NP and RP
echo 1 | sudo tee /sys/class/net/$NIC/ecn/roce_np/enable/3
echo 1 | sudo tee /sys/class/net/$NIC/ecn/roce_rp/enable/3

# 5. Max out the rings
sudo ethtool -G $NIC rx 8192 tx 8192

B. VALIDATE — confirm config matches design

# Full mlnx_qos sanity check
sudo mlnx_qos -i ib0 | head -30
# Look for:
#   PFC configuration: enabled  0 0 0 1 0 0 0 0
#   tc-trust:          dscp
#   DSCP-priority:     dscp 26 -> priority 3
#   priority 3 -> TC 3

# ECN enable check (all priorities)
for p in 0 1 2 3 4 5 6 7; do
  np=$(cat /sys/class/net/ib0/ecn/roce_np/enable/$p 2>/dev/null)
  rp=$(cat /sys/class/net/ib0/ecn/roce_rp/enable/$p 2>/dev/null)
  echo "Priority $p: NP=$np RP=$rp"
done
# Expected: Priority 3 NP=1 RP=1

# Trust mode
cat /sys/class/net/ib0/qos/trust
# Expected: dscp

# Cross-NIC consistency — all 4 NICs identical
for nic in ib0 ib1 ib2 ib3; do
  echo "=== $nic ==="
  sudo mlnx_qos -i $nic | grep -E "PFC|enabled|trust|26.*3"
done
# Drift between NICs = bug

C. VERIFY — prove lossless under load

Validation confirms config matches the design. Verification confirms the design actually delivers losslessness when traffic flows. Use the pre/post counter diff pattern above with a real workload:

# Peer: ib_send_bw -d mlx5_0 -x 3
# This host:
ib_send_bw -d mlx5_0 -x 3 -D 60 <peer_ip>

Then check the deltas. The healthy pattern is lots of bytes, modest ECN/CNP activity, near-zero PFC. The unhealthy pattern is lots of PFC, low CNP — ECN didn't fire in time, PFC caught the overflow, throughput collapsed during pauses.

Interpretation table

Counter delta	Expected	If different
`tx_prio_3_bytes` ≫ `tx_prio_0_bytes`	Yes	DSCP marking broken — check trust mode + `traffic_class` sysfs
`np_cnp_sent`	Modest, tracks ECN marks	Zero with marks > 0 = NP disabled
`rp_cnp_handled`	Tracks peer's `np_cnp_sent`	Low = RP disabled, DCQCN broken
`rx_prio3_pause_duration`	< 5% of test time	High = ECN tuning too lax; switch is hitting XOFF instead of ECN-marking
`packet_seq_err`	Zero	Any value = actual drops happened; "lossless" isn't
`port_xmit_wait`	Stable	Climbing = PFC backpressure is real and frequent

D. USE — how applications consume lossless RDMA

D.1 RDMA apps mark DSCP from the NIC's per-port default

When an app creates a QP, the NIC stamps egress packets with whichever DSCP is configured in the per-port traffic_class (or the app overrides it via traffic_class in the QP creation). Most apps (MPI, NCCL) don't set it explicitly — they inherit the NIC default.

# Set the per-port default for outgoing RoCE traffic
echo 106 | sudo tee /sys/class/infiniband/mlx5_0/tc/1/traffic_class
# 106 = TOS byte for DSCP 26 + ECT(0)

# Verify
cat /sys/class/infiniband/mlx5_0/tc/1/traffic_class

D.2 NCCL — explicit override

For NCCL training jobs, override per-job:

export NCCL_IB_HCA=ib0,ib1,ib2,ib3   # 4-rail backend
export NCCL_IB_GID_INDEX=3            # RoCE v2 GID
export NCCL_IB_TC=106                 # DSCP 26 + ECT(0)
mpirun -np 8 ./train.py

NCCL_IB_TC=106 is the canonical setting and the one to memorize. If you see NCCL_IB_TC=26 somewhere in someone's job script, it is wrong — they're getting DSCP 6, not DSCP 26.

D.3 Troubleshoot when DSCP marking is wrong

# Capture egress, look at TOS byte
sudo tcpdump -i ib0 -nn -e ip and host <peer> -c 10 -v | grep tos

# If tos != 0x68 (DSCP 26 no ECN) or 0x6a (DSCP 26 + ECT(0)):
#   1. Check trust mode: cat /sys/class/net/ib0/qos/trust  (must be 'dscp')
#   2. Check DSCP-to-TC mapping: sudo mlnx_qos -i ib0 | grep dscp
#   3. Check NCCL_IB_TC value (or the per-port traffic_class sysfs)
#   4. Check the app isn't overriding traffic_class with the wrong value
#   5. Check the app opened the QP on the expected NIC (multi-rail issue)

A sample healthy configuration snapshot

For a sample host running a 4-NIC backend at 400G, the configuration looks like this when correctly deployed:

Setting	Value	Why it's right
DCBX mode	OS-controlled	Modern default — the OS owns QoS, not switch-pushed DCBX
Trust mode	DSCP	L3 classification is the modern path
DSCP→Priority mapping	groups-of-8 (DSCP 24–31 → priority 3)	Robust against single-DSCP typos; DSCP 26 lands in priority 3
Priority→TC mapping	priority 3 → TC 3	Lossless queue isolated
PFC enabled priorities	3 only	Tight scope, no deadlock surface area
Lossless buffer (priority 3)	270 KB on a sample host (7m cable)	Adequate for short DAC runs; verify under load for longer cables
ECN on priority 3	NP=1, RP=1	Feedback loop closed both ways
Ring sizes (TX/RX)	8192 / 8192	At the NIC max for 400G
TSA (Transmission Selection)	vendor	NIC-default scheduling, no custom carving
4-NIC consistency	Identical config	No drift between rails

After a fresh boot with no workload, every hw_counter should read zero. The only non-zero value at quiescence is lifespan = 12, which is the kernel refresh interval for the counter file, not a traffic statistic. Once workloads run, you watch the deltas — not the absolute values.

💡 What you should remember

#		Concept	Why it matters
1	🚫	Switch-side config + host-side config = lossless.	Either side wrong, the whole thing is silently broken. The switch side does PFC PAUSE, ECN marking, and DCQCN-tuned buffer profiles. The host side does mlnx_qos + sysfs + ring tuning.
2	🏷️	Trust mode must be `dscp`.	PCP trust on a modern fabric means your DSCP marks get ignored. Check `/sys/class/net/<nic>/qos/trust`.
3	📦	PFC enable only on priority 3.	Tight scope = no deadlock. `--pfc 0,0,0,1,0,0,0,0` and nothing else.
4	🔁	ECN needs NP and RP enabled on priority 3.	NP-only means you generate CNPs but ignore the ones you receive. Half a loop is no loop.
5	🧮	`NCCL_IB_TC=106`, not 26.	The TOS byte is `(DSCP << 2) \| ECN_bits`. 26 left-shifted by 2 = 104, plus ECT(0) bit = 106. The 26-without-the-shift mistake is the most common RoCE config bug in the wild.
6	📊	Counters live in two trees.	`hw_counters/` for RDMA-specific stuff (CNP, DCQCN, sequence errors). `counters/` for IB-spec basics (bytes, link state, physical). PFC counters are in `ethtool -S` only.
7	🛠️	Always use pre/post-test deltas.	Cumulative counters since boot tell you nothing about your test. Snapshot before, snapshot after, diff.
8	⚠️	Healthy load: lots of bytes, modest ECN/CNP, near-zero PFC.	Unhealthy load: PFC pause duration ≫ a few % of test time means ECN isn't firing in time — that's a switch-side tuning issue, but you can see it from the host counters.
9	🔀	Drift across rails is a bug.	All 4 NICs on a multi-rail host must show identical mlnx_qos output. If one drifted, find out why and re-apply the config.

Next: Multi-Rail Source Routing → — why a multi-rail host needs source-based routing, and the ARP / rp_filter / routing-table setup that keeps each rail's traffic pinned to its own NIC.

Why this page exists​

The host-side config trio​

Trust mode — DSCP vs PCP​

DSCP→Priority→TC mapping​

Enable PFC on priority 3​

Enable ECN on priority 3 (NP and RP)​

Ring buffer tuning​

The NCCL_IB_TC=106 math​

Setting it for an NCCL job​

Verifying DSCP on the wire​

The counter reference card​

Path 1: /sys/class/infiniband/<dev>/ports/1/hw_counters/​

Path 2: /sys/class/infiniband/<dev>/ports/1/counters/​

Per-priority Ethernet counters via ethtool​

Which tree do I look at?​

Pre/post-test counter diff pattern​

Interpreting the deltas​

CREATE / VALIDATE / VERIFY / USE — the host-side playbook​

A. CREATE — configure the NIC​

B. VALIDATE — confirm config matches design​

C. VERIFY — prove lossless under load​

Interpretation table​

D. USE — how applications consume lossless RDMA​

D.1 RDMA apps mark DSCP from the NIC's per-port default​

D.2 NCCL — explicit override​

D.3 Troubleshoot when DSCP marking is wrong​

A sample healthy configuration snapshot​

💡 What you should remember​