Skip to main content

ECN — Explicit Congestion Notification

ECN is the early warning system. Where PFC stops the sender after a buffer fills, ECN tells the sender to slow down before the buffer fills.

Per the standard (RFC 3168), ECN uses two bits in the IP header. The switch marks them on packets traversing congested egress; the receiver echoes back to the sender; the sender adjusts its rate.

DCQCN feedback loop. Sender NIC (RP, Reaction Point) transmits at high rate to a switch. Switch's queue fills, crosses the ECN threshold, switch sets the CE bit on packets. ECN-marked packets reach the receiver NIC (NP, Notification Point) which emits a CNP back to the sender. Sender's DCQCN logic reduces rate (multiplicative decrease). As ECN marks stop, CNPs stop, sender ramps rate back up. 6 numbered steps in two columns explain the cycle.
ECN fires *before* PFC by design. CNP carries the bad news back. DCQCN turns the bad news into a rate adjustment. Goal: keep queues short enough that PFC almost never has to fire.

This page covers the mechanics and the tuning that matters for AI fabrics.

After this page, you'll be able to
  1. Decode the ECN bits — Not-ECT / ECT(0) / ECT(1) / CE — and explain why the CE bit is non-destructive (the packet still arrives).
  2. Walk the WRED marking curvemin_th, max_th, max_p, and what each does to congestion signal strength.
  3. Trace the DCQCN loop end-to-end — switch marks → CNP back → multiplicative decrease → CNP gap → additive increase.
  4. Configure ECN on all three layers — switch WRED, sender NIC RP, receiver NIC NP. Know why missing any one breaks the loop.

The ECN bits

In the IP TOS / DSCP byte, the lowest two bits are ECN:

BitsCode PointMeaning
00Not-ECTNot ECN-capable (legacy / unmarked)
01ECT(1)ECN-capable, "ECT one"
10ECT(0)ECN-capable, "ECT zero"
11CECongestion Experienced (marked by switch)

The sender sets ECT(0) or ECT(1) when emitting a packet (RoCE v2 NICs do ECT(0) by default). When a congested switch decides to mark, it rewrites those bits to CE (11). The receiver sees CE and signals the sender to slow down.

The marking is non-destructive — the packet still reaches the destination, just with a flag set. That's the key difference from drop-based congestion signaling.

Exactly where the bits live

The two ECN bits are the low 2 bits of the second byte of the IP header — the byte the IETF calls the Differentiated Services (DS) field, the old ToS byte:

The 8-bit IP DS byte. Bits 7..2 are DSCP (shown as 011010 = DSCP 26) — the switch reads these to pick the queue (26 maps to priority 3). Bits 1..0 are ECN — the switch writes these to signal congestion. The four ECN code points are listed: 00 Not-ECT, 01 ECT(1), 10 ECT(0) which RoCE senders set by default, and 11 CE which a congested switch writes to mark Congestion Experienced.
One byte, two jobs — and it lives in the IP header (L3). Identical whether the L4 above is TCP, UDP, or RoCE v2; the switch never reads L4 to mark.

One byte does double duty. The switch reads the top 6 bits (DSCP) on ingress to decide which queue the packet lands in, and — if that queue is congested past the WRED threshold — rewrites the bottom 2 bits (ECN) to CE on the way out. Same byte, two jobs. This is L3: it sits in the IP header, so it is identical whether the payload above is TCP, UDP, or a RoCE v2 packet. The switch never looks at L4 to mark.


Is this TCP's ECN? No — and the difference is the whole point

ECN marking is the same everywhere (RFC 3168, the IP bits above). What differs — completely — is how the bad news gets back to the sender, and what the sender does with it. This is the question architects actually care about:

Classic TCP ECNDCTCPRoCE v2 — DCQCN
L4 transportTCPTCPUDP (port 4791)
How CE is echoedreceiver sets the ECE flag in the TCP header of its ACKsreceiver echoes every CE; sender estimates the marked fraction αreceiver NIC mints a separate CNP packet (no ACK to ride on — UDP has none)
Sender reactionhalve the congestion window (cwnd), set the CWR flagcut window proportionally: cwnd × (1 − α/2)cut the QP's send rate (hardware rate limiter), per DCQCN
Control variablewindowwindowrate
Reaction lives inthe TCP stack (kernel)the TCP stack (kernel)NIC silicon (no CPU)
Granularity~1 reaction / RTTproportional, per RTTcontinuous rate, coalesced CNPs

The key realization: RoCE v2 runs over UDP, and UDP has no ACKs and no header flags. TCP can piggyback congestion feedback (ECE/CWR are two bits in a header that's already flowing back as part of the reliable byte-stream). RoCE can't — so the receiving NIC has to manufacture a brand-new packet, the CNP, purely to carry the word "slow down" upstream. And because RDMA has no congestion window, the sender doesn't shrink a window — it turns a rate knob. Window-based vs rate-based is the deepest structural difference between TCP congestion control and RoCE congestion control.


How the switch marks

Most modern AI switches use WRED (Weighted Random Early Detection) or AQM (Active Queue Management) for ECN marking:

  • Minimum threshold (min_th) — below this queue depth, no marking.
  • Maximum threshold (max_th) — at or above this, mark every packet.
  • Maximum probability (max_p) — between min and max, mark with probability that scales linearly with queue depth.
marking
probability
| ___________
1.0 | /
| /
| /
max_p ........ /
| /:
| / :
0.0 |______/__:__________________ queue depth
min_th max_th

For RoCE v2 + DCQCN, typical tunings:

ParameterTypical value (400G fabric)
min_th50–150 KB
max_th1–2 MB
max_p0.1–0.2 (10–20% marking probability at max_th)

These values are per-vendor and per-buffer-size; the numbers above are starting points. Vendors publish reference values (NVIDIA Spectrum-X, Arista AI references) that you should read before tuning.


The ECN feedback loop

End-to-end:

  1. Sender NIC sends RoCE v2 packets with ECT(0) set.
  2. Switch at the congested egress marks ECN to CE on a fraction of those packets (per WRED/AQM curve).
  3. Receiver NIC sees the CE-marked packets, generates a CNP (Congestion Notification Packet — a one-packet RDMA control message) and sends it back to the sender.
  4. Sender NIC's DCQCN engine sees the CNP and dials down the sender's rate for that flow / QP.
  5. Sender keeps sending at the lower rate; queue at the switch drains.
  6. If no further CNPs arrive for a window, sender ramps the rate back up.

The whole loop is fully NIC-offloaded — no CPU on either end. CNPs are small (a few dozen bytes) and ride the RoCE v2 path in reverse.


The CNP, on the wire

The CNP is a real RoCE v2 packet, not a flag — so it's worth knowing exactly what it is, because you'll grep for it in NIC counters (np_cnp_sent, rp_cnp_handled) when a fabric misbehaves.

The RoCE v2 CNP packet on the wire: Ethernet 14 B, IP 20 B, UDP 8 B (dst port 4791), BTH 12 B carrying opcode 0x81, a 16 B reserved field, and ICRC 4 B. The BTH with opcode 0x81 is the hero field; it has a zero-byte payload and is addressed to the source queue pair. The CNP travels from the receiver NIC back to the sender NIC, riding UDP 4791 in reverse on its own DSCP so it does not queue behind the congestion it reports.
TCP has no CNP — the ECE bit rides the ACK that was already flowing back. RoCE runs on UDP with no ACK, so it spends a whole packet just to carry 'slow down.'
  • BTH opcode 0x81 is the RoCE v2 CNP — a dedicated opcode that means nothing but "you, the source QP in this header, are causing congestion." It carries zero data payload.
  • It's addressed to the source QP of the CE-marked packet, so the sender NIC knows precisely which of its thousands of queue pairs to throttle — not the whole link, one flow.
  • The receiver NIC coalesces CNPs: at most one per congested QP per cnp_min_period (often ~ 4–50 µs, NIC-tunable). A storm of CE marks does not become a storm of CNPs — otherwise the feedback would itself congest the reverse path.
  • CNPs usually travel on their own DSCP / priority (DSCP 48, priority 6 is the common convention), often in an expedited or separate lossless class, so the "slow down" message never gets stuck in the queue behind the very traffic it's reporting on. A CNP that arrives late is a CNP that didn't help.

Contrast this with TCP: there is no "CNP" in TCP because the ECE bit is already riding the ACK that TCP was going to send anyway. RoCE has to spend a whole packet — that's the price of putting reliability and congestion control in the NIC instead of a transport stack.

This is also why the end-to-end RoCE v2 transaction page matters: the CNP shares the wire and the QP machinery (PSN, ACK/NAK) with the data path, all in silicon.


DCQCN — the algorithm that uses ECN

DCQCN (Data Center Quantized Congestion Notification) is the rate-adjustment algorithm that lives on the sender's NIC. It's the canonical RoCE v2 congestion controller, published at SIGCOMM 2015 by Microsoft Research and now the default in nearly every production RoCE v2 fabric.

The papers and vendor docs name three roles in the loop — learn them, because every DCQCN knob is filed under one of them:

  • CP — Congestion Point = the switch. It detects the full queue and sets CE (the WRED marking above). It's stateless about flows; it just marks.
  • NP — Notification Point = the receiver NIC. It sees CE, generates and coalesces the CNP. NP-side knobs (np_cnp_dscp, cnp_min_period) control how the alarm is raised.
  • RP — Reaction Point = the sender NIC. It receives CNPs and runs the rate controller. RP-side knobs (α update, Rate AI/HAI, fast-recovery) control how the sender backs off and recovers.

DCQCN is a borrow: the rate controller (RP) is QCN's, the signal (NP/CP) is ECN's — quantized congestion notification, carried by ECN instead of QCN's L2 frames, so it survives being routed across an IP fabric.

Key behaviors:

  • Rate-based, not window-based. Each QP has a target send rate that gets adjusted up and down.
  • Multiplicative decrease on CNP arrival: rate ← rate × (1 - α/2), where α is an EMA of CNP signal strength.
  • Additive increase when no CNPs arrive: rate ← rate + R_AI per timer interval.
  • Fast recovery — when CNPs stop, ramp up aggressively for the first few intervals to recover bandwidth.

DCQCN's tunables (per-NIC):

KnobTypicalWhat it controls
Kmin / Kmax (echoed thresholds)match switch WREDWhen to start reacting
α update interval50–500 μsHow quickly the rate-cut signal averages
Rate AI (additive increase)5 MbpsHow fast to ramp back up
Rate HAI (hyper-active increase)50 MbpsHow fast to recover after a long pause
Fast recovery threshold5 cyclesHow long to wait before HAI kicks in

The defaults from the NVIDIA driver work well for most fabrics. Tuning is needed only when you have a non-standard topology, oversubscription pattern, or buffer profile.


How to actually configure

Watch the tune-and-verify loop on the rockynet lab simulator — inspect baseline Kmin/Kmax on the RoCE traffic class, drop Kmin from 150 KB to 80 KB so ECN marks earlier, then drive the same load and see ECN-marked-TX jump 27× while PFC PAUSE stays at zero:

MODULE switch-qos · LAB 2Watch the recording — every command, every counter, every output.

End-to-end ECN setup involves three layers — switch, sender NIC, receiver NIC. The classic mistake is configuring only one of the three.

On the switch — WRED marking on queue 3 (the RoCE priority), between 50 KB and 1 MB of depth, max 10% probability. Same intent, four dialects:

qos profile lossless
queue 3 ecn min 51200 max 1024000 probability 0.1
interface Ethernet1/1
service-policy type qos input lossless

All four say the same thing: on queue 3, mark CE between 50 KB and 1 MB of queue depth, with up to 10% probability.

On the sender NIC (Mellanox/NVIDIA — mlnx_qos / sysctl):

mlnx_qos -i eth0 --trust dscp
echo 1 > /sys/class/net/eth0/ecn/roce_np/enable/3
echo 1 > /sys/class/net/eth0/ecn/roce_rp/enable/3

This enables both the NP (Notification Point) and RP (Reaction Point) sides of DCQCN on priority 3.

On the receiver NIC:

Same NP enable as the sender — receivers must be allowed to generate CNPs in response to CE-marked packets.


What can go wrong

The three most common ECN misconfigurations:

  1. DSCP not preserved across hops. A switch in the middle clears DSCP, the ECN bits get reset, marking doesn't reach the receiver. Always trust DSCP end to end.
  2. NIC ECN disabled. Some default NIC images ship with ECN/DCQCN off. The fabric marks but the NIC ignores. Verify with ethtool --show-priv-flags or driver-specific commands.
  3. WRED thresholds too high. Min threshold close to max means ECN never marks at moderate congestion — PFC has to handle it. Tune min_th down (within reason) so DCQCN gets early warnings.
Anti-pattern

Configuring switch-side ECN only and forgetting the NIC side. The fabric dutifully marks CE bits, the receiver dutifully ignores them, and the sender never sees a CNP — congestion control is completely silent while PFC keeps firing. Verify all three layers the moment you turn ECN on: switch WRED is marking (counters going up), receiver NP is sending CNPs (ethtool stats), sender RP is throttling (NIC rate counters). One missing layer = no DCQCN.


💡 What you should remember

#ConceptWhy it matters
1🚨ECN marks packets at congested egressTwo bits in the IP TOS byte. CE (11) is set by the switch. Non-destructive — the packet still reaches the destination.
2📈WRED / AQM = probabilistic markingBelow min_th: no marks. Above max_th: every packet. In between: linear ramp to max_p. Tune from vendor reference, don't invent.
3📨CNP carries the bad news backReceiver NIC sees CE → emits a tiny RDMA control packet → sender's DCQCN engine throttles. Loop is fully NIC-offloaded, no CPU.
4🎚️DCQCN is rate-based, not window-basedMultiplicative decrease on CNP arrival, additive increase when CNPs stop. Hyper-active increase for fast recovery.
5🔗Three layers must agreeSwitch WRED · sender NIC RP · receiver NIC NP. Any one missing and the control loop is silent.
6🪢If PFC fires, ECN tuning is wrongLower min_th so DCQCN gets earlier warnings, or fix DSCP preservation. PFC is the safety net, not the primary signal.
7🧭It is not TCP's ECNMarking is the same IP bits (L3, protocol-agnostic). The echo differs: TCP rides ECE/CWR on the ACK; RoCE runs on UDP with no ACK, so the receiver NIC mints a CNP (opcode 0x81). Window vs rate.

Next: DCQCN — Buffer Profiles & Tuning at Scale → — putting it all together with real buffer profiles and field-tested values.