Skip to main content

ECN — Explicit Congestion Notification

ECN is the early warning system. Where PFC stops the sender after a buffer fills, ECN tells the sender to slow down before the buffer fills.

Per the standard (RFC 3168), ECN uses two bits in the IP header. The switch marks them on packets traversing congested egress; the receiver echoes back to the sender; the sender adjusts its rate.

DCQCN feedback loop. Sender NIC (RP, Reaction Point) transmits at high rate to a switch. Switch's queue fills, crosses the ECN threshold, switch sets the CE bit on packets. ECN-marked packets reach the receiver NIC (NP, Notification Point) which emits a CNP back to the sender. Sender's DCQCN logic reduces rate (multiplicative decrease). As ECN marks stop, CNPs stop, sender ramps rate back up. 6 numbered steps in two columns explain the cycle.
ECN fires *before* PFC by design. CNP carries the bad news back. DCQCN turns the bad news into a rate adjustment. Goal: keep queues short enough that PFC almost never has to fire.

This page covers the mechanics and the tuning that matters for AI fabrics.


The ECN bits

In the IP TOS / DSCP byte, the lowest two bits are ECN:

BitsCode PointMeaning
00Not-ECTNot ECN-capable (legacy / unmarked)
01ECT(1)ECN-capable, "ECT one"
10ECT(0)ECN-capable, "ECT zero"
11CECongestion Experienced (marked by switch)

The sender sets ECT(0) or ECT(1) when emitting a packet (RoCE v2 NICs do ECT(0) by default). When a congested switch decides to mark, it rewrites those bits to CE (11). The receiver sees CE and signals the sender to slow down.

The marking is non-destructive — the packet still reaches the destination, just with a flag set. That's the key difference from drop-based congestion signaling.


How the switch marks

Most modern AI switches use WRED (Weighted Random Early Detection) or AQM (Active Queue Management) for ECN marking:

  • Minimum threshold (min_th) — below this queue depth, no marking.
  • Maximum threshold (max_th) — at or above this, mark every packet.
  • Maximum probability (max_p) — between min and max, mark with probability that scales linearly with queue depth.
marking
probability
| ___________
1.0 | /
| /
| /
max_p ........ /
| /:
| / :
0.0 |______/__:__________________ queue depth
min_th max_th

For RoCE v2 + DCQCN, typical tunings:

ParameterTypical value (400G fabric)
min_th50–150 KB
max_th1–2 MB
max_p0.1–0.2 (10–20% marking probability at max_th)

These values are per-vendor and per-buffer-size; the numbers above are starting points. Vendors publish reference values (NVIDIA Spectrum-X, Arista AI references) that you should read before tuning.


The ECN feedback loop

End-to-end:

  1. Sender NIC sends RoCE v2 packets with ECT(0) set.
  2. Switch at the congested egress marks ECN to CE on a fraction of those packets (per WRED/AQM curve).
  3. Receiver NIC sees the CE-marked packets, generates a CNP (Congestion Notification Packet — a one-packet RDMA control message) and sends it back to the sender.
  4. Sender NIC's DCQCN engine sees the CNP and dials down the sender's rate for that flow / QP.
  5. Sender keeps sending at the lower rate; queue at the switch drains.
  6. If no further CNPs arrive for a window, sender ramps the rate back up.

The whole loop is fully NIC-offloaded — no CPU on either end. CNPs are small (a few dozen bytes) and ride the RoCE v2 path in reverse.


DCQCN — the algorithm that uses ECN

DCQCN (Data Center Quantized Congestion Notification) is the rate-adjustment algorithm that lives on the sender's NIC. It's the canonical RoCE v2 congestion controller, published at SIGCOMM 2015 by Microsoft Research and now the default in nearly every production RoCE v2 fabric.

Key behaviors:

  • Rate-based, not window-based. Each QP has a target send rate that gets adjusted up and down.
  • Multiplicative decrease on CNP arrival: rate ← rate × (1 - α/2), where α is an EMA of CNP signal strength.
  • Additive increase when no CNPs arrive: rate ← rate + R_AI per timer interval.
  • Fast recovery — when CNPs stop, ramp up aggressively for the first few intervals to recover bandwidth.

DCQCN's tunables (per-NIC):

KnobTypicalWhat it controls
Kmin / Kmax (echoed thresholds)match switch WREDWhen to start reacting
α update interval50–500 μsHow quickly the rate-cut signal averages
Rate AI (additive increase)5 MbpsHow fast to ramp back up
Rate HAI (hyper-active increase)50 MbpsHow fast to recover after a long pause
Fast recovery threshold5 cyclesHow long to wait before HAI kicks in

The defaults from the NVIDIA driver work well for most fabrics. Tuning is needed only when you have a non-standard topology, oversubscription pattern, or buffer profile.


How to actually configure

End-to-end ECN setup involves three layers — switch, sender NIC, receiver NIC. The classic mistake is configuring only one of the three.

On the switch:

! Arista example (conceptual)
qos profile lossless
queue 3 ecn min 51200 max 1024000 probability 0.1
interface Ethernet1/1
service-policy type qos input lossless

This says: on queue 3 (the RoCE priority), enable WRED marking between 50 KB and 1 MB with max 10% marking probability.

On the sender NIC (Mellanox/NVIDIA — mlnx_qos / sysctl):

mlnx_qos -i eth0 --trust dscp
echo 1 > /sys/class/net/eth0/ecn/roce_np/enable/3
echo 1 > /sys/class/net/eth0/ecn/roce_rp/enable/3

This enables both the NP (Notification Point) and RP (Reaction Point) sides of DCQCN on priority 3.

On the receiver NIC:

Same NP enable as the sender — receivers must be allowed to generate CNPs in response to CE-marked packets.


What can go wrong

The three most common ECN misconfigurations:

  1. DSCP not preserved across hops. A switch in the middle clears DSCP, the ECN bits get reset, marking doesn't reach the receiver. Always trust DSCP end to end.
  2. NIC ECN disabled. Some default NIC images ship with ECN/DCQCN off. The fabric marks but the NIC ignores. Verify with ethtool --show-priv-flags or driver-specific commands.
  3. WRED thresholds too high. Min threshold close to max means ECN never marks at moderate congestion — PFC has to handle it. Tune min_th down (within reason) so DCQCN gets early warnings.

What you should remember

  • ECN marks packets at congested egress using the 2 ECN bits in the IP TOS byte (CE = 11).
  • WRED / AQM is the marking policy — probabilistic between min and max queue thresholds.
  • CNP (Congestion Notification Packet) is how the receiver tells the sender "you marked me, slow down."
  • DCQCN is the NIC-side algorithm that turns CNPs into rate adjustments. Rate-based, additive increase + multiplicative decrease.
  • End-to-end config has three layers — switch WRED, sender NIC, receiver NIC. All three must agree.
  • If PFC is firing, ECN tuning is wrong. PFC is the safety net; ECN should be the primary signal.

Next: DCQCN — Buffer Profiles & Tuning at Scale → — putting it all together with real buffer profiles and field-tested values.