ECN — Explicit Congestion Notification

ECN is the early warning system. Where PFC stops the sender after a buffer fills, ECN tells the sender to slow down before the buffer fills.

Per the standard (RFC 3168), ECN uses two bits in the IP header. The switch marks them on packets traversing congested egress; the receiver echoes back to the sender; the sender adjusts its rate.

DCQCN feedback loop. Sender NIC (RP, Reaction Point) transmits at high rate to a switch. Switch's queue fills, crosses the ECN threshold, switch sets the CE bit on packets. ECN-marked packets reach the receiver NIC (NP, Notification Point) which emits a CNP back to the sender. Sender's DCQCN logic reduces rate (multiplicative decrease). As ECN marks stop, CNPs stop, sender ramps rate back up. 6 numbered steps in two columns explain the cycle. — ECN fires *before* PFC by design. CNP carries the bad news back. DCQCN turns the bad news into a rate adjustment. Goal: keep queues short enough that PFC almost never has to fire.

This page covers the mechanics and the tuning that matters for AI fabrics.

The ECN bits

In the IP TOS / DSCP byte, the lowest two bits are ECN:

Bits	Code Point	Meaning
00	Not-ECT	Not ECN-capable (legacy / unmarked)
01	ECT(1)	ECN-capable, "ECT one"
10	ECT(0)	ECN-capable, "ECT zero"
11	CE	Congestion Experienced (marked by switch)

The sender sets ECT(0) or ECT(1) when emitting a packet (RoCE v2 NICs do ECT(0) by default). When a congested switch decides to mark, it rewrites those bits to CE (11). The receiver sees CE and signals the sender to slow down.

The marking is non-destructive — the packet still reaches the destination, just with a flag set. That's the key difference from drop-based congestion signaling.

How the switch marks

Most modern AI switches use WRED (Weighted Random Early Detection) or AQM (Active Queue Management) for ECN marking:

Minimum threshold (min_th) — below this queue depth, no marking.
Maximum threshold (max_th) — at or above this, mark every packet.
Maximum probability (max_p) — between min and max, mark with probability that scales linearly with queue depth.

  marking
probability
    |             ___________
1.0 |            /
    |           /
    |          /
max_p ........ /
    |        /:
    |       / :
0.0 |______/__:__________________  queue depth
        min_th max_th

For RoCE v2 + DCQCN, typical tunings:

Parameter	Typical value (400G fabric)
`min_th`	50–150 KB
`max_th`	1–2 MB
`max_p`	0.1–0.2 (10–20% marking probability at max_th)

These values are per-vendor and per-buffer-size; the numbers above are starting points. Vendors publish reference values (NVIDIA Spectrum-X, Arista AI references) that you should read before tuning.

The ECN feedback loop

End-to-end:

Sender NIC sends RoCE v2 packets with ECT(0) set.
Switch at the congested egress marks ECN to CE on a fraction of those packets (per WRED/AQM curve).
Receiver NIC sees the CE-marked packets, generates a CNP (Congestion Notification Packet — a one-packet RDMA control message) and sends it back to the sender.
Sender NIC's DCQCN engine sees the CNP and dials down the sender's rate for that flow / QP.
Sender keeps sending at the lower rate; queue at the switch drains.
If no further CNPs arrive for a window, sender ramps the rate back up.

The whole loop is fully NIC-offloaded — no CPU on either end. CNPs are small (a few dozen bytes) and ride the RoCE v2 path in reverse.

DCQCN — the algorithm that uses ECN

DCQCN (Data Center Quantized Congestion Notification) is the rate-adjustment algorithm that lives on the sender's NIC. It's the canonical RoCE v2 congestion controller, published at SIGCOMM 2015 by Microsoft Research and now the default in nearly every production RoCE v2 fabric.

Key behaviors:

Rate-based, not window-based. Each QP has a target send rate that gets adjusted up and down.
Multiplicative decrease on CNP arrival: rate ← rate × (1 - α/2), where α is an EMA of CNP signal strength.
Additive increase when no CNPs arrive: rate ← rate + R_AI per timer interval.
Fast recovery — when CNPs stop, ramp up aggressively for the first few intervals to recover bandwidth.

DCQCN's tunables (per-NIC):

Knob	Typical	What it controls
`Kmin / Kmax` (echoed thresholds)	match switch WRED	When to start reacting
`α update interval`	50–500 μs	How quickly the rate-cut signal averages
`Rate AI` (additive increase)	5 Mbps	How fast to ramp back up
`Rate HAI` (hyper-active increase)	50 Mbps	How fast to recover after a long pause
`Fast recovery threshold`	5 cycles	How long to wait before HAI kicks in

The defaults from the NVIDIA driver work well for most fabrics. Tuning is needed only when you have a non-standard topology, oversubscription pattern, or buffer profile.

How to actually configure

End-to-end ECN setup involves three layers — switch, sender NIC, receiver NIC. The classic mistake is configuring only one of the three.

On the switch:

! Arista example (conceptual)
qos profile lossless
   queue 3 ecn min 51200 max 1024000 probability 0.1
interface Ethernet1/1
   service-policy type qos input lossless

This says: on queue 3 (the RoCE priority), enable WRED marking between 50 KB and 1 MB with max 10% marking probability.

On the sender NIC (Mellanox/NVIDIA — mlnx_qos / sysctl):

mlnx_qos -i eth0 --trust dscp
echo 1 > /sys/class/net/eth0/ecn/roce_np/enable/3
echo 1 > /sys/class/net/eth0/ecn/roce_rp/enable/3

This enables both the NP (Notification Point) and RP (Reaction Point) sides of DCQCN on priority 3.

On the receiver NIC:

Same NP enable as the sender — receivers must be allowed to generate CNPs in response to CE-marked packets.

What can go wrong

The three most common ECN misconfigurations:

DSCP not preserved across hops. A switch in the middle clears DSCP, the ECN bits get reset, marking doesn't reach the receiver. Always trust DSCP end to end.
NIC ECN disabled. Some default NIC images ship with ECN/DCQCN off. The fabric marks but the NIC ignores. Verify with ethtool --show-priv-flags or driver-specific commands.
WRED thresholds too high. Min threshold close to max means ECN never marks at moderate congestion — PFC has to handle it. Tune min_th down (within reason) so DCQCN gets early warnings.

What you should remember

ECN marks packets at congested egress using the 2 ECN bits in the IP TOS byte (CE = 11).
WRED / AQM is the marking policy — probabilistic between min and max queue thresholds.
CNP (Congestion Notification Packet) is how the receiver tells the sender "you marked me, slow down."
DCQCN is the NIC-side algorithm that turns CNPs into rate adjustments. Rate-based, additive increase + multiplicative decrease.
End-to-end config has three layers — switch WRED, sender NIC, receiver NIC. All three must agree.
If PFC is firing, ECN tuning is wrong. PFC is the safety net; ECN should be the primary signal.

Next: DCQCN — Buffer Profiles & Tuning at Scale → — putting it all together with real buffer profiles and field-tested values.

The ECN bits​

How the switch marks​

The ECN feedback loop​

DCQCN — the algorithm that uses ECN​

How to actually configure​

What can go wrong​

What you should remember​