Skip to main content

PFC + ECN + DCQCN — the Lossless Trick

You know that RoCE v2 is IB's transport on Ethernet. The catch: Ethernet wasn't designed to be lossless. TCP exists because Ethernet drops, and TCP retransmits. RDMA's go-back-N retransmit logic is far cruder than TCP's — a single drop on a RoCE v2 fabric can kill throughput.

So you can't run RDMA on raw Ethernet. You need to make Ethernet behave lossless enough that drops are extremely rare. That's what PFC, ECN, and DCQCN do together — three mechanisms that combine to approximate InfiniBand's credit-based guarantee.

Worth saying upfront: this is the trickiest part of operating a RoCE v2 fabric. Tuning the three knobs correctly is the difference between a fabric that hits line rate and one that PFC-storms itself into a corner.


The three layers

MechanismLayerWhen it firesWhat it does
ECN (Early Congestion Notification)IP (L3)Queue depth crosses an early thresholdMarks packets with the CE bit; receiver sends a CNP back; sender slows down.
DCQCN (Data Center QCN)NIC (algorithm)Sender receives CNPsReduces transmit rate; ramps back up when CNPs stop.
PFC (Priority Flow Control)Ethernet (L2)Queue depth crosses a late (XOFF) thresholdSends a PAUSE frame upstream; sender stops on that priority class.

The intended order: ECN fires first (mild signal), DCQCN brings rates down, PFC is the last-resort safety net so the buffer doesn't actually overflow.

If ECN + DCQCN are tuned right, PFC almost never fires. PFC firing in production is a yellow light — your fabric is closer to dropping than you wanted.


1. PFC — the last-resort backpressure

PFC (IEEE 802.1Qbb) lets a switch send a PAUSE frame upstream telling the sender to stop transmitting on a specific priority class. The sender stops for pause_quanta (typically a few hundred microseconds), then resumes.

PFC mechanics shown across three nodes. Sender NIC on the left transmits to a middle switch; middle switch transmits to a receiver NIC on the right. The receiver is consuming data slowly, so the middle switch's egress buffer fills up. When the buffer crosses the XOFF watermark, the middle switch sends a PFC PAUSE frame back upstream to the sender, telling it to stop on priority 3 (RoCE class). The PAUSE arrives after congestion has already started; bytes in flight during the round trip must fit in the reserved 'headroom' or the buffer overflows and drops happen anyway.
PFC pauses *after* congestion has started. The bytes already in flight need headroom buffer to land in. Tuning the headroom for line-rate × cable-length × propagation-delay is the operator's problem.

The mechanics:

  1. Each port has per-priority buffers. RoCE v2 traffic typically lands on priority 3 (DSCP 26 → TC3).
  2. When the buffer for that priority crosses the XOFF watermark, the switch sends a PFC PAUSE frame upstream.
  3. The upstream sender stops transmitting for that priority only — other traffic classes keep flowing.
  4. When the buffer drains below the XON watermark, the switch sends a PAUSE-zero (resume) frame.
  5. The sender resumes.

The catch — headroom:

Between the moment the switch decides "I need to pause" and the moment the upstream sender actually stops, a bunch of bytes are already in the air. At 400 Gbps over a 100m cable, you can have ~50 KB in flight. That's data the switch has to absorb anyway. If the buffer doesn't have headroom reserved for those bytes, they get dropped despite the PAUSE. So PFC needs careful sizing: headroom ≥ pause_response_time × line_rate.

The really nasty catch — PFC storms:

PAUSE frames are received by switches upstream and propagate further back. If a slow consumer causes its leaf to pause its spine, the spine then pauses its other leaves, and so on. A single misbehaving GPU can backpressure the entire fabric — a PFC storm. Symptoms: training step time goes from 200ms to 8 seconds, monitoring shows pause counts climbing on every switch. The fix is usually to isolate the slow consumer; the diagnosis is non-trivial.


2. ECN — the warning that fires earlier

ECN (RFC 3168) is a 2-bit field in the IP header. Switches mark the CE (Congestion Experienced) bit on packets when they detect congestion — before the queue is full enough to drop or PFC.

The interesting trick: ECN doesn't drop packets, it just marks them. The bytes still flow. But the receiver sees the mark and sends a signal back — that's the CNP.

Why ECN is the smart signal and PFC is the dumb one:

ECNPFC
GranularityPer-flow (per QP/PSN)Per-priority class on a whole port
Reaction timeOne RTT (CNP round trip)Microseconds
Side effectsNone (it's just a bit)Stops all traffic on that priority — even unrelated flows
RiskNoneStorms, head-of-line blocking, deadlock
Tuning surfaceOne watermark per queueWatermarks + headroom + pause-time + buffer carving

ECN is gentle and surgical. PFC is a sledgehammer. You want the sledgehammer in reserve.


3. DCQCN — the closed-loop glue

ECN gives the network a way to mark packets. CNPs give the receiver a way to notify the sender. But there has to be an algorithm that decides what to do with those signals — how much to slow down, when to speed back up. That's DCQCN.

DCQCN feedback loop. Sender NIC (RP, Reaction Point) transmits packets at high rate to a switch. Switch's queue starts to fill; when it crosses the ECN threshold, the switch marks packets with the CE bit. The marked packets arrive at the receiver NIC (NP, Notification Point), which emits a Congestion Notification Packet (CNP) back to the sender. The sender's DCQCN logic reduces its transmit rate (multiplicative decrease, similar to TCP cubic). As ECN marks stop arriving at the receiver, CNPs stop, and the sender ramps the rate back up (additive increase + hyper-increase if no CNP for K cycles). The whole loop runs in NIC hardware without app or kernel involvement.
ECN fires *before* PFC by design. CNP carries the bad news back. Sender slows down. Goal: keep queue depths low enough that PFC almost never has to fire.

The algorithm at high level:

  • Sender NIC = RP (Reaction Point). Receives CNPs. Adjusts rate per QP.
  • Receiver NIC = NP (Notification Point). Sees ECN-marked packets. Emits CNPs back (usually rate-limited — one CNP per RTT typically, not one per marked packet).
  • On CNP: RP does a multiplicative decrease — typically α parameter, sometimes 50% or smaller per CNP. Similar shape to TCP cubic.
  • No CNP for K cycles: RP does an additive increase, gradually ramping rate back up.
  • No CNP for longer: hyper-increase — aggressive ramp to recover faster.

Microsoft published the original DCQCN paper at SIGCOMM 2015. Most production RoCE v2 fabrics in 2026 still use a DCQCN-family algorithm with vendor-specific parameter tuning.


The watermark stack that makes it work

The make-or-break tuning decision: where to put the ECN threshold relative to the PFC threshold.

buffer fill ─────────────────────────────────────────►
│ max

│ XOFF ──→ PFC PAUSE fires
│ (late)
│ ECN ──→ packets get marked
│ (early)
0
  • ECN threshold should be substantially below the PFC threshold
  • ECN gives DCQCN time to react and slow the sender before the queue fills
  • PFC is only there to catch the case where DCQCN didn't react fast enough (bursty traffic, sudden topology change, etc.)

If ECN threshold is too high → ECN doesn't fire early enough → queue fills → PFC fires → storms.

If ECN threshold is too low → DCQCN keeps backing off unnecessarily → throughput suffers, training step time goes up.

Most production fabrics target the ECN watermark around 20–30% of buffer depth and PFC XOFF around 80%, but exact numbers depend on switch silicon, queue size, RTT, and workload.


Confirm RoCE v2 is actually on the wire

The other side of the theory: verifying with ibv_devinfo, tcpdump 'udp port 4791', and ethtool that RoCE v2 traffic is flowing and DCQCN is doing its job:

MODULE roce-v2 · LAB 1Watch the recording or run the real environment in your browser.
Open in CodespacesFREE
💡 Codespaces gives every GitHub user 60 free hours/month. No install required.

Highlights: GID[3] showing RoCE v2 derived from IPv4, tcpdump capturing line-rate packets on UDP 4791, NIC counters showing CNPs (DCQCN actively rate-limiting) and zero adp_retrans (no drops).


Failure modes operators actually hit

SymptomLikely causeFirst thing to check
Training step time spikes from 200ms to secondsPFC storm — slow consumer backpressuring the fabricPFC counters on every switch in the path; find the link with the highest pause count
Throughput dropped from 380 Gbps to 200 Gbps, no PFC pauses visibleECN over-tuned (firing too aggressively) → DCQCN too conservativeECN counter rate, CNP rate at NICs; compare to baseline
Random IBV_WC_RETRY_EXC_ERR in NCCL logsBuffer overflowed despite PFC — headroom too small for cable/RTTSwitch drop counters on the RoCE priority; check headroom config
Half of GPUs in a job are slowHash polarization on ECMP (not a CC issue, but presents like one)ECMP load distribution per leaf-spine link
Training hangs mid-stepPFC deadlock — cyclic dependency between buffersLook for "victim" flow patterns; consider buffer reorganization

What you should remember

  • Three layers, one job: keep Ethernet lossless enough for RDMA.
  • PFC = sledgehammer. Per-priority pause frames. Last-resort safety net. Firing often = problem.
  • ECN = scalpel. Marks packets, doesn't drop. Triggers DCQCN. Gentler, smarter signal.
  • DCQCN = the closed-loop algorithm that turns CNPs into rate adjustments. NIC-side hardware.
  • Tuning the ECN watermark below the PFC watermark is the make-or-break operational decision.
  • PFC storms are the #1 failure mode at scale — a single slow consumer can wedge a whole fabric.
  • ECN over-tuning is the #2 — throughput dies quietly with no error.

Next: How a RoCE v2 Transaction Actually Flows → — connect the dots: which layers come from IB vs Ethernet, how a 1 MB WRITE chunks into 256 packets, how the IB transport carries reliability over UDP/IP/Ethernet underneath.