PFC + ECN + DCQCN — the Lossless Trick
You know that RoCE v2 is IB's transport on Ethernet. The catch: Ethernet wasn't designed to be lossless. TCP exists because Ethernet drops, and TCP retransmits. RDMA's go-back-N retransmit logic is far cruder than TCP's — a single drop on a RoCE v2 fabric can kill throughput.
So you can't run RDMA on raw Ethernet. You need to make Ethernet behave lossless enough that drops are extremely rare. That's what PFC, ECN, and DCQCN do together — three mechanisms that combine to approximate InfiniBand's credit-based guarantee.
Worth saying upfront: this is the trickiest part of operating a RoCE v2 fabric. Tuning the three knobs correctly is the difference between a fabric that hits line rate and one that PFC-storms itself into a corner.
The three layers
| Mechanism | Layer | When it fires | What it does |
|---|---|---|---|
| ECN (Early Congestion Notification) | IP (L3) | Queue depth crosses an early threshold | Marks packets with the CE bit; receiver sends a CNP back; sender slows down. |
| DCQCN (Data Center QCN) | NIC (algorithm) | Sender receives CNPs | Reduces transmit rate; ramps back up when CNPs stop. |
| PFC (Priority Flow Control) | Ethernet (L2) | Queue depth crosses a late (XOFF) threshold | Sends a PAUSE frame upstream; sender stops on that priority class. |
The intended order: ECN fires first (mild signal), DCQCN brings rates down, PFC is the last-resort safety net so the buffer doesn't actually overflow.
If ECN + DCQCN are tuned right, PFC almost never fires. PFC firing in production is a yellow light — your fabric is closer to dropping than you wanted.
1. PFC — the last-resort backpressure
PFC (IEEE 802.1Qbb) lets a switch send a PAUSE frame upstream telling the sender to stop transmitting on a specific priority class. The sender stops for pause_quanta (typically a few hundred microseconds), then resumes.
The mechanics:
- Each port has per-priority buffers. RoCE v2 traffic typically lands on priority 3 (DSCP 26 → TC3).
- When the buffer for that priority crosses the XOFF watermark, the switch sends a PFC PAUSE frame upstream.
- The upstream sender stops transmitting for that priority only — other traffic classes keep flowing.
- When the buffer drains below the XON watermark, the switch sends a PAUSE-zero (resume) frame.
- The sender resumes.
The catch — headroom:
Between the moment the switch decides "I need to pause" and the moment the upstream sender actually stops, a bunch of bytes are already in the air. At 400 Gbps over a 100m cable, you can have ~50 KB in flight. That's data the switch has to absorb anyway. If the buffer doesn't have headroom reserved for those bytes, they get dropped despite the PAUSE. So PFC needs careful sizing: headroom ≥ pause_response_time × line_rate.
The really nasty catch — PFC storms:
PAUSE frames are received by switches upstream and propagate further back. If a slow consumer causes its leaf to pause its spine, the spine then pauses its other leaves, and so on. A single misbehaving GPU can backpressure the entire fabric — a PFC storm. Symptoms: training step time goes from 200ms to 8 seconds, monitoring shows pause counts climbing on every switch. The fix is usually to isolate the slow consumer; the diagnosis is non-trivial.
2. ECN — the warning that fires earlier
ECN (RFC 3168) is a 2-bit field in the IP header. Switches mark the CE (Congestion Experienced) bit on packets when they detect congestion — before the queue is full enough to drop or PFC.
The interesting trick: ECN doesn't drop packets, it just marks them. The bytes still flow. But the receiver sees the mark and sends a signal back — that's the CNP.
Why ECN is the smart signal and PFC is the dumb one:
| ECN | PFC | |
|---|---|---|
| Granularity | Per-flow (per QP/PSN) | Per-priority class on a whole port |
| Reaction time | One RTT (CNP round trip) | Microseconds |
| Side effects | None (it's just a bit) | Stops all traffic on that priority — even unrelated flows |
| Risk | None | Storms, head-of-line blocking, deadlock |
| Tuning surface | One watermark per queue | Watermarks + headroom + pause-time + buffer carving |
ECN is gentle and surgical. PFC is a sledgehammer. You want the sledgehammer in reserve.
3. DCQCN — the closed-loop glue
ECN gives the network a way to mark packets. CNPs give the receiver a way to notify the sender. But there has to be an algorithm that decides what to do with those signals — how much to slow down, when to speed back up. That's DCQCN.
The algorithm at high level:
- Sender NIC = RP (Reaction Point). Receives CNPs. Adjusts rate per QP.
- Receiver NIC = NP (Notification Point). Sees ECN-marked packets. Emits CNPs back (usually rate-limited — one CNP per RTT typically, not one per marked packet).
- On CNP: RP does a multiplicative decrease — typically
αparameter, sometimes 50% or smaller per CNP. Similar shape to TCP cubic. - No CNP for
Kcycles: RP does an additive increase, gradually ramping rate back up. - No CNP for longer: hyper-increase — aggressive ramp to recover faster.
Microsoft published the original DCQCN paper at SIGCOMM 2015. Most production RoCE v2 fabrics in 2026 still use a DCQCN-family algorithm with vendor-specific parameter tuning.
The watermark stack that makes it work
The make-or-break tuning decision: where to put the ECN threshold relative to the PFC threshold.
buffer fill ─────────────────────────────────────────►
│ max
│
│ XOFF ──→ PFC PAUSE fires
│ (late)
│ ECN ──→ packets get marked
│ (early)
0
- ECN threshold should be substantially below the PFC threshold
- ECN gives DCQCN time to react and slow the sender before the queue fills
- PFC is only there to catch the case where DCQCN didn't react fast enough (bursty traffic, sudden topology change, etc.)
If ECN threshold is too high → ECN doesn't fire early enough → queue fills → PFC fires → storms.
If ECN threshold is too low → DCQCN keeps backing off unnecessarily → throughput suffers, training step time goes up.
Most production fabrics target the ECN watermark around 20–30% of buffer depth and PFC XOFF around 80%, but exact numbers depend on switch silicon, queue size, RTT, and workload.
Confirm RoCE v2 is actually on the wire
The other side of the theory: verifying with ibv_devinfo, tcpdump 'udp port 4791', and ethtool that RoCE v2 traffic is flowing and DCQCN is doing its job:
Highlights: GID[3] showing RoCE v2 derived from IPv4, tcpdump capturing line-rate packets on UDP 4791, NIC counters showing CNPs (DCQCN actively rate-limiting) and zero adp_retrans (no drops).
Failure modes operators actually hit
| Symptom | Likely cause | First thing to check |
|---|---|---|
| Training step time spikes from 200ms to seconds | PFC storm — slow consumer backpressuring the fabric | PFC counters on every switch in the path; find the link with the highest pause count |
| Throughput dropped from 380 Gbps to 200 Gbps, no PFC pauses visible | ECN over-tuned (firing too aggressively) → DCQCN too conservative | ECN counter rate, CNP rate at NICs; compare to baseline |
Random IBV_WC_RETRY_EXC_ERR in NCCL logs | Buffer overflowed despite PFC — headroom too small for cable/RTT | Switch drop counters on the RoCE priority; check headroom config |
| Half of GPUs in a job are slow | Hash polarization on ECMP (not a CC issue, but presents like one) | ECMP load distribution per leaf-spine link |
| Training hangs mid-step | PFC deadlock — cyclic dependency between buffers | Look for "victim" flow patterns; consider buffer reorganization |
What you should remember
- Three layers, one job: keep Ethernet lossless enough for RDMA.
- PFC = sledgehammer. Per-priority pause frames. Last-resort safety net. Firing often = problem.
- ECN = scalpel. Marks packets, doesn't drop. Triggers DCQCN. Gentler, smarter signal.
- DCQCN = the closed-loop algorithm that turns CNPs into rate adjustments. NIC-side hardware.
- Tuning the ECN watermark below the PFC watermark is the make-or-break operational decision.
- PFC storms are the #1 failure mode at scale — a single slow consumer can wedge a whole fabric.
- ECN over-tuning is the #2 — throughput dies quietly with no error.
Next: How a RoCE v2 Transaction Actually Flows → — connect the dots: which layers come from IB vs Ethernet, how a 1 MB WRITE chunks into 256 packets, how the IB transport carries reliability over UDP/IP/Ethernet underneath.