ECN — Explicit Congestion Notification
ECN is the early warning system. Where PFC stops the sender after a buffer fills, ECN tells the sender to slow down before the buffer fills.
Per the standard (RFC 3168), ECN uses two bits in the IP header. The switch marks them on packets traversing congested egress; the receiver echoes back to the sender; the sender adjusts its rate.
This page covers the mechanics and the tuning that matters for AI fabrics.
- Decode the ECN bits — Not-ECT / ECT(0) / ECT(1) / CE — and explain why the CE bit is non-destructive (the packet still arrives).
- Walk the WRED marking curve —
min_th,max_th,max_p, and what each does to congestion signal strength. - Trace the DCQCN loop end-to-end — switch marks → CNP back → multiplicative decrease → CNP gap → additive increase.
- Configure ECN on all three layers — switch WRED, sender NIC RP, receiver NIC NP. Know why missing any one breaks the loop.
The ECN bits
In the IP TOS / DSCP byte, the lowest two bits are ECN:
| Bits | Code Point | Meaning |
|---|---|---|
| 00 | Not-ECT | Not ECN-capable (legacy / unmarked) |
| 01 | ECT(1) | ECN-capable, "ECT one" |
| 10 | ECT(0) | ECN-capable, "ECT zero" |
| 11 | CE | Congestion Experienced (marked by switch) |
The sender sets ECT(0) or ECT(1) when emitting a packet (RoCE v2 NICs do ECT(0) by default). When a congested switch decides to mark, it rewrites those bits to CE (11). The receiver sees CE and signals the sender to slow down.
The marking is non-destructive — the packet still reaches the destination, just with a flag set. That's the key difference from drop-based congestion signaling.
Exactly where the bits live
The two ECN bits are the low 2 bits of the second byte of the IP header — the byte the IETF calls the Differentiated Services (DS) field, the old ToS byte:
One byte does double duty. The switch reads the top 6 bits (DSCP) on ingress to decide which queue the packet lands in, and — if that queue is congested past the WRED threshold — rewrites the bottom 2 bits (ECN) to CE on the way out. Same byte, two jobs. This is L3: it sits in the IP header, so it is identical whether the payload above is TCP, UDP, or a RoCE v2 packet. The switch never looks at L4 to mark.
Is this TCP's ECN? No — and the difference is the whole point
ECN marking is the same everywhere (RFC 3168, the IP bits above). What differs — completely — is how the bad news gets back to the sender, and what the sender does with it. This is the question architects actually care about:
| Classic TCP ECN | DCTCP | RoCE v2 — DCQCN | |
|---|---|---|---|
| L4 transport | TCP | TCP | UDP (port 4791) |
| How CE is echoed | receiver sets the ECE flag in the TCP header of its ACKs | receiver echoes every CE; sender estimates the marked fraction α | receiver NIC mints a separate CNP packet (no ACK to ride on — UDP has none) |
| Sender reaction | halve the congestion window (cwnd), set the CWR flag | cut window proportionally: cwnd × (1 − α/2) | cut the QP's send rate (hardware rate limiter), per DCQCN |
| Control variable | window | window | rate |
| Reaction lives in | the TCP stack (kernel) | the TCP stack (kernel) | NIC silicon (no CPU) |
| Granularity | ~1 reaction / RTT | proportional, per RTT | continuous rate, coalesced CNPs |
The key realization: RoCE v2 runs over UDP, and UDP has no ACKs and no header flags. TCP can piggyback congestion feedback (ECE/CWR are two bits in a header that's already flowing back as part of the reliable byte-stream). RoCE can't — so the receiving NIC has to manufacture a brand-new packet, the CNP, purely to carry the word "slow down" upstream. And because RDMA has no congestion window, the sender doesn't shrink a window — it turns a rate knob. Window-based vs rate-based is the deepest structural difference between TCP congestion control and RoCE congestion control.
How the switch marks
Most modern AI switches use WRED (Weighted Random Early Detection) or AQM (Active Queue Management) for ECN marking:
- Minimum threshold (
min_th) — below this queue depth, no marking. - Maximum threshold (
max_th) — at or above this, mark every packet. - Maximum probability (
max_p) — between min and max, mark with probability that scales linearly with queue depth.
marking
probability
| ___________
1.0 | /
| /
| /
max_p ........ /
| /:
| / :
0.0 |______/__:__________________ queue depth
min_th max_th
For RoCE v2 + DCQCN, typical tunings:
| Parameter | Typical value (400G fabric) |
|---|---|
min_th | 50–150 KB |
max_th | 1–2 MB |
max_p | 0.1–0.2 (10–20% marking probability at max_th) |
These values are per-vendor and per-buffer-size; the numbers above are starting points. Vendors publish reference values (NVIDIA Spectrum-X, Arista AI references) that you should read before tuning.
The ECN feedback loop
End-to-end:
- Sender NIC sends RoCE v2 packets with ECT(0) set.
- Switch at the congested egress marks ECN to CE on a fraction of those packets (per WRED/AQM curve).
- Receiver NIC sees the CE-marked packets, generates a CNP (Congestion Notification Packet — a one-packet RDMA control message) and sends it back to the sender.
- Sender NIC's DCQCN engine sees the CNP and dials down the sender's rate for that flow / QP.
- Sender keeps sending at the lower rate; queue at the switch drains.
- If no further CNPs arrive for a window, sender ramps the rate back up.
The whole loop is fully NIC-offloaded — no CPU on either end. CNPs are small (a few dozen bytes) and ride the RoCE v2 path in reverse.
The CNP, on the wire
The CNP is a real RoCE v2 packet, not a flag — so it's worth knowing exactly what it is, because you'll grep for it in NIC counters (np_cnp_sent, rp_cnp_handled) when a fabric misbehaves.
- BTH opcode
0x81is the RoCE v2 CNP — a dedicated opcode that means nothing but "you, the source QP in this header, are causing congestion." It carries zero data payload. - It's addressed to the source QP of the CE-marked packet, so the sender NIC knows precisely which of its thousands of queue pairs to throttle — not the whole link, one flow.
- The receiver NIC coalesces CNPs: at most one per congested QP per
cnp_min_period(often ~ 4–50 µs, NIC-tunable). A storm of CE marks does not become a storm of CNPs — otherwise the feedback would itself congest the reverse path. - CNPs usually travel on their own DSCP / priority (DSCP 48, priority 6 is the common convention), often in an expedited or separate lossless class, so the "slow down" message never gets stuck in the queue behind the very traffic it's reporting on. A CNP that arrives late is a CNP that didn't help.
Contrast this with TCP: there is no "CNP" in TCP because the ECE bit is already riding the ACK that TCP was going to send anyway. RoCE has to spend a whole packet — that's the price of putting reliability and congestion control in the NIC instead of a transport stack.
This is also why the end-to-end RoCE v2 transaction page matters: the CNP shares the wire and the QP machinery (PSN, ACK/NAK) with the data path, all in silicon.
DCQCN — the algorithm that uses ECN
DCQCN (Data Center Quantized Congestion Notification) is the rate-adjustment algorithm that lives on the sender's NIC. It's the canonical RoCE v2 congestion controller, published at SIGCOMM 2015 by Microsoft Research and now the default in nearly every production RoCE v2 fabric.
The papers and vendor docs name three roles in the loop — learn them, because every DCQCN knob is filed under one of them:
- CP — Congestion Point = the switch. It detects the full queue and sets CE (the WRED marking above). It's stateless about flows; it just marks.
- NP — Notification Point = the receiver NIC. It sees CE, generates and coalesces the CNP. NP-side knobs (
np_cnp_dscp,cnp_min_period) control how the alarm is raised. - RP — Reaction Point = the sender NIC. It receives CNPs and runs the rate controller. RP-side knobs (
αupdate,Rate AI/HAI, fast-recovery) control how the sender backs off and recovers.
DCQCN is a borrow: the rate controller (RP) is QCN's, the signal (NP/CP) is ECN's — quantized congestion notification, carried by ECN instead of QCN's L2 frames, so it survives being routed across an IP fabric.
Key behaviors:
- Rate-based, not window-based. Each QP has a target send rate that gets adjusted up and down.
- Multiplicative decrease on CNP arrival:
rate ← rate × (1 - α/2), whereαis an EMA of CNP signal strength. - Additive increase when no CNPs arrive:
rate ← rate + R_AIper timer interval. - Fast recovery — when CNPs stop, ramp up aggressively for the first few intervals to recover bandwidth.
DCQCN's tunables (per-NIC):
| Knob | Typical | What it controls |
|---|---|---|
Kmin / Kmax (echoed thresholds) | match switch WRED | When to start reacting |
α update interval | 50–500 μs | How quickly the rate-cut signal averages |
Rate AI (additive increase) | 5 Mbps | How fast to ramp back up |
Rate HAI (hyper-active increase) | 50 Mbps | How fast to recover after a long pause |
Fast recovery threshold | 5 cycles | How long to wait before HAI kicks in |
The defaults from the NVIDIA driver work well for most fabrics. Tuning is needed only when you have a non-standard topology, oversubscription pattern, or buffer profile.
How to actually configure
Watch the tune-and-verify loop on the rockynet lab simulator — inspect baseline Kmin/Kmax on the RoCE traffic class, drop Kmin from 150 KB to 80 KB so ECN marks earlier, then drive the same load and see ECN-marked-TX jump 27× while PFC PAUSE stays at zero:
End-to-end ECN setup involves three layers — switch, sender NIC, receiver NIC. The classic mistake is configuring only one of the three.
On the switch — WRED marking on queue 3 (the RoCE priority), between 50 KB and 1 MB of depth, max 10% probability. Same intent, four dialects:
- 1. Arista EOS
- 2. Cisco NX-OS
- 3. Juniper Junos
- 4. NVIDIA Spectrum
qos profile lossless
queue 3 ecn min 51200 max 1024000 probability 0.1
interface Ethernet1/1
service-policy type qos input lossless
policy-map type queuing ROCE-OUT
class type queuing c-out-8q-q3
random-detect minimum-threshold 50 kbytes maximum-threshold 1000 kbytes drop-probability 7 weight 0 ecn
interface Ethernet1/1
service-policy type queuing output ROCE-OUT
The ecn keyword on random-detect is what turns WRED dropping into WRED marking — without it the switch drops instead of setting CE.
set class-of-service drop-profiles ROCE-ECN interpolate fill-level [ 20 80 ] drop-probability [ 0 10 ]
set class-of-service schedulers ROCE-SCHED drop-profile-map loss-priority low protocol any drop-profile ROCE-ECN
set class-of-service schedulers ROCE-SCHED explicit-congestion-notification
The drop-profile fill-level/probability interpolation is the WRED band; explicit-congestion-notification on the scheduler makes it mark CE instead of dropping.
nv set qos roce congestion-control ecn
nv set qos roce ecn threshold-min 51200
nv set qos roce ecn threshold-max 1024000
nv set qos roce ecn probability 10
nv config apply
All four say the same thing: on queue 3, mark CE between 50 KB and 1 MB of queue depth, with up to 10% probability.
On the sender NIC (Mellanox/NVIDIA — mlnx_qos / sysctl):
mlnx_qos -i eth0 --trust dscp
echo 1 > /sys/class/net/eth0/ecn/roce_np/enable/3
echo 1 > /sys/class/net/eth0/ecn/roce_rp/enable/3
This enables both the NP (Notification Point) and RP (Reaction Point) sides of DCQCN on priority 3.
On the receiver NIC:
Same NP enable as the sender — receivers must be allowed to generate CNPs in response to CE-marked packets.
What can go wrong
The three most common ECN misconfigurations:
- DSCP not preserved across hops. A switch in the middle clears DSCP, the ECN bits get reset, marking doesn't reach the receiver. Always trust DSCP end to end.
- NIC ECN disabled. Some default NIC images ship with ECN/DCQCN off. The fabric marks but the NIC ignores. Verify with
ethtool --show-priv-flagsor driver-specific commands. - WRED thresholds too high. Min threshold close to max means ECN never marks at moderate congestion — PFC has to handle it. Tune
min_thdown (within reason) so DCQCN gets early warnings.
Configuring switch-side ECN only and forgetting the NIC side. The fabric dutifully marks CE bits, the receiver dutifully ignores them, and the sender never sees a CNP — congestion control is completely silent while PFC keeps firing. Verify all three layers the moment you turn ECN on: switch WRED is marking (counters going up), receiver NP is sending CNPs (ethtool stats), sender RP is throttling (NIC rate counters). One missing layer = no DCQCN.
💡 What you should remember
| # | Concept | Why it matters | |
|---|---|---|---|
| 1 | 🚨 | ECN marks packets at congested egress | Two bits in the IP TOS byte. CE (11) is set by the switch. Non-destructive — the packet still reaches the destination. |
| 2 | 📈 | WRED / AQM = probabilistic marking | Below min_th: no marks. Above max_th: every packet. In between: linear ramp to max_p. Tune from vendor reference, don't invent. |
| 3 | 📨 | CNP carries the bad news back | Receiver NIC sees CE → emits a tiny RDMA control packet → sender's DCQCN engine throttles. Loop is fully NIC-offloaded, no CPU. |
| 4 | 🎚️ | DCQCN is rate-based, not window-based | Multiplicative decrease on CNP arrival, additive increase when CNPs stop. Hyper-active increase for fast recovery. |
| 5 | 🔗 | Three layers must agree | Switch WRED · sender NIC RP · receiver NIC NP. Any one missing and the control loop is silent. |
| 6 | 🪢 | If PFC fires, ECN tuning is wrong | Lower min_th so DCQCN gets earlier warnings, or fix DSCP preservation. PFC is the safety net, not the primary signal. |
| 7 | 🧭 | It is not TCP's ECN | Marking is the same IP bits (L3, protocol-agnostic). The echo differs: TCP rides ECE/CWR on the ACK; RoCE runs on UDP with no ACK, so the receiver NIC mints a CNP (opcode 0x81). Window vs rate. |
Next: DCQCN — Buffer Profiles & Tuning at Scale → — putting it all together with real buffer profiles and field-tested values.