ECN — Explicit Congestion Notification
ECN is the early warning system. Where PFC stops the sender after a buffer fills, ECN tells the sender to slow down before the buffer fills.
Per the standard (RFC 3168), ECN uses two bits in the IP header. The switch marks them on packets traversing congested egress; the receiver echoes back to the sender; the sender adjusts its rate.
This page covers the mechanics and the tuning that matters for AI fabrics.
The ECN bits
In the IP TOS / DSCP byte, the lowest two bits are ECN:
| Bits | Code Point | Meaning |
|---|---|---|
| 00 | Not-ECT | Not ECN-capable (legacy / unmarked) |
| 01 | ECT(1) | ECN-capable, "ECT one" |
| 10 | ECT(0) | ECN-capable, "ECT zero" |
| 11 | CE | Congestion Experienced (marked by switch) |
The sender sets ECT(0) or ECT(1) when emitting a packet (RoCE v2 NICs do ECT(0) by default). When a congested switch decides to mark, it rewrites those bits to CE (11). The receiver sees CE and signals the sender to slow down.
The marking is non-destructive — the packet still reaches the destination, just with a flag set. That's the key difference from drop-based congestion signaling.
How the switch marks
Most modern AI switches use WRED (Weighted Random Early Detection) or AQM (Active Queue Management) for ECN marking:
- Minimum threshold (
min_th) — below this queue depth, no marking. - Maximum threshold (
max_th) — at or above this, mark every packet. - Maximum probability (
max_p) — between min and max, mark with probability that scales linearly with queue depth.
marking
probability
| ___________
1.0 | /
| /
| /
max_p ........ /
| /:
| / :
0.0 |______/__:__________________ queue depth
min_th max_th
For RoCE v2 + DCQCN, typical tunings:
| Parameter | Typical value (400G fabric) |
|---|---|
min_th | 50–150 KB |
max_th | 1–2 MB |
max_p | 0.1–0.2 (10–20% marking probability at max_th) |
These values are per-vendor and per-buffer-size; the numbers above are starting points. Vendors publish reference values (NVIDIA Spectrum-X, Arista AI references) that you should read before tuning.
The ECN feedback loop
End-to-end:
- Sender NIC sends RoCE v2 packets with ECT(0) set.
- Switch at the congested egress marks ECN to CE on a fraction of those packets (per WRED/AQM curve).
- Receiver NIC sees the CE-marked packets, generates a CNP (Congestion Notification Packet — a one-packet RDMA control message) and sends it back to the sender.
- Sender NIC's DCQCN engine sees the CNP and dials down the sender's rate for that flow / QP.
- Sender keeps sending at the lower rate; queue at the switch drains.
- If no further CNPs arrive for a window, sender ramps the rate back up.
The whole loop is fully NIC-offloaded — no CPU on either end. CNPs are small (a few dozen bytes) and ride the RoCE v2 path in reverse.
DCQCN — the algorithm that uses ECN
DCQCN (Data Center Quantized Congestion Notification) is the rate-adjustment algorithm that lives on the sender's NIC. It's the canonical RoCE v2 congestion controller, published at SIGCOMM 2015 by Microsoft Research and now the default in nearly every production RoCE v2 fabric.
Key behaviors:
- Rate-based, not window-based. Each QP has a target send rate that gets adjusted up and down.
- Multiplicative decrease on CNP arrival:
rate ← rate × (1 - α/2), whereαis an EMA of CNP signal strength. - Additive increase when no CNPs arrive:
rate ← rate + R_AIper timer interval. - Fast recovery — when CNPs stop, ramp up aggressively for the first few intervals to recover bandwidth.
DCQCN's tunables (per-NIC):
| Knob | Typical | What it controls |
|---|---|---|
Kmin / Kmax (echoed thresholds) | match switch WRED | When to start reacting |
α update interval | 50–500 μs | How quickly the rate-cut signal averages |
Rate AI (additive increase) | 5 Mbps | How fast to ramp back up |
Rate HAI (hyper-active increase) | 50 Mbps | How fast to recover after a long pause |
Fast recovery threshold | 5 cycles | How long to wait before HAI kicks in |
The defaults from the NVIDIA driver work well for most fabrics. Tuning is needed only when you have a non-standard topology, oversubscription pattern, or buffer profile.
How to actually configure
End-to-end ECN setup involves three layers — switch, sender NIC, receiver NIC. The classic mistake is configuring only one of the three.
On the switch:
! Arista example (conceptual)
qos profile lossless
queue 3 ecn min 51200 max 1024000 probability 0.1
interface Ethernet1/1
service-policy type qos input lossless
This says: on queue 3 (the RoCE priority), enable WRED marking between 50 KB and 1 MB with max 10% marking probability.
On the sender NIC (Mellanox/NVIDIA — mlnx_qos / sysctl):
mlnx_qos -i eth0 --trust dscp
echo 1 > /sys/class/net/eth0/ecn/roce_np/enable/3
echo 1 > /sys/class/net/eth0/ecn/roce_rp/enable/3
This enables both the NP (Notification Point) and RP (Reaction Point) sides of DCQCN on priority 3.
On the receiver NIC:
Same NP enable as the sender — receivers must be allowed to generate CNPs in response to CE-marked packets.
What can go wrong
The three most common ECN misconfigurations:
- DSCP not preserved across hops. A switch in the middle clears DSCP, the ECN bits get reset, marking doesn't reach the receiver. Always trust DSCP end to end.
- NIC ECN disabled. Some default NIC images ship with ECN/DCQCN off. The fabric marks but the NIC ignores. Verify with
ethtool --show-priv-flagsor driver-specific commands. - WRED thresholds too high. Min threshold close to max means ECN never marks at moderate congestion — PFC has to handle it. Tune
min_thdown (within reason) so DCQCN gets early warnings.
What you should remember
- ECN marks packets at congested egress using the 2 ECN bits in the IP TOS byte (CE =
11). - WRED / AQM is the marking policy — probabilistic between min and max queue thresholds.
- CNP (Congestion Notification Packet) is how the receiver tells the sender "you marked me, slow down."
- DCQCN is the NIC-side algorithm that turns CNPs into rate adjustments. Rate-based, additive increase + multiplicative decrease.
- End-to-end config has three layers — switch WRED, sender NIC, receiver NIC. All three must agree.
- If PFC is firing, ECN tuning is wrong. PFC is the safety net; ECN should be the primary signal.
Next: DCQCN — Buffer Profiles & Tuning at Scale → — putting it all together with real buffer profiles and field-tested values.