Skip to main content

PFC + ECN + DCQCN — the Lossless Trick

You know that RoCE v2 is IB's transport on Ethernet. The catch: Ethernet wasn't designed to be lossless. TCP exists because Ethernet drops, and TCP retransmits. RDMA's go-back-N retransmit logic is far cruder than TCP's — a single drop on a RoCE v2 fabric can tank throughput, because the loss of one packet forces retransmission of everything after it.

So you can't run RDMA on raw Ethernet. You have to make it behave lossless enough that drops are extremely rare — which is exactly the PFC + ECN + DCQCN stack you configured back in Switch QoS. This page is the RoCE-transport view of that stack: why it's the thing standing between your fabric and a throughput collapse, the one tuning rule that decides whether it works, and how to confirm it's actually running on the wire.

The mechanics live in Switch QoS — this is the recap

The frame formats, headroom sizing, ECN/WRED marking curves, the DCQCN rate-control formula, and the vendor CLI were all covered in Switch QoS (ch.06), back in Phase 3. This page doesn't re-derive them — it frames them from the RoCE side and links back wherever you want the depth.

After this page, you'll be able to
  1. Say why RoCE specifically needs the lossless stackgo-back-N retransmit is cruder than TCP, so one drop stalls the whole transfer.
  2. Place the three layers in firing order — ECN marks early (L3) → DCQCN reacts at the NIC → PFC PAUSE is the last-resort L2 net that should almost never fire.
  3. State the one tuning rule that matters — keep the ECN threshold well below the PFC XOFF threshold (the field-tested numbers are in Switch QoS).
  4. Confirm RoCE v2 + DCQCN on the wiretcpdump 'udp port 4791', GID[3], and the NIC CNP / adp_retrans counters.

The three layers, from the RoCE side

Three mechanisms combine to approximate InfiniBand's credit-based guarantee on commodity Ethernet. RoCE v2 needs all three precisely because none of them is as clean as IB's link-layer credits — you're approximating losslessness with feedback loops instead of one hardware guarantee.

MechanismLayerRoleMechanics & config
ECNIP (L3)The scalpel — marks packets (CE bit) at an early queue threshold. Per-flow, gentle, no side effects.6.2 — ECN
DCQCNNIC (algorithm)The glue — turns CE marks → CNP → a NIC-side rate cut, then ramps back up when the CNPs stop.6.2 — ECN · 6.3 — Tuning
PFCEthernet (L2)The sledgehammer — a PAUSE frame at a late threshold stops the whole priority class. Last-resort safety net.6.1 — PFC

The intended order: ECN fires first (mild signal), DCQCN brings rates down, PFC catches only what slips through.

If ECN + DCQCN are tuned right, PFC almost never fires. PFC firing in steady state is a yellow light — your fabric is closer to dropping than you wanted.


The one tuning rule that decides everything

Whether this stack hits line rate or storms itself into a corner comes down to one relationship: the ECN marking threshold must sit well below the PFC XOFF threshold.

  • ECN too high → it doesn't warn early enough → the queue fills → PFC fires → storms.
  • ECN too low → DCQCN backs off constantly → throughput quietly bleeds, with no error to point at.

Most fabrics land the ECN watermark around ~20–30% of buffer depth and PFC XOFF around ~80% — but the exact watermarks, buffer profiles, and field-tested starting values are switch-specific and live in 6.3 — DCQCN, Buffer Profiles & Tuning.


Confirm RoCE v2 is actually on the wire

The other side of the theory: verifying with ibv_devinfo, tcpdump 'udp port 4791', and ethtool that RoCE v2 traffic is flowing and DCQCN is doing its job:

MODULE roce-v2 · LAB 1Watch the recording — every command, every counter, every output.

Highlights: GID[3] showing RoCE v2 derived from IPv4, tcpdump capturing line-rate packets on UDP 4791, NIC counters showing CNPs (DCQCN actively rate-limiting) and zero adp_retrans (no drops triggered the IB retransmit safety net).


Failure modes operators actually hit

When this stack is misconfigured, here's how it shows up on a RoCE v2 fabric — and where to look. (The switch-side diagnosis for each lives in Switch QoS and Production Operations.)

SymptomLikely causeFirst thing to check
Training step time spikes from 200 ms to secondsPFC storm — slow consumer backpressuring the fabricPFC counters on every switch in the path; find the link with the highest pause count
Throughput dropped from 380 → 200 Gbps, no PFC pauses visibleECN over-tuned (firing too aggressively) → DCQCN too conservativeECN counter rate, CNP rate at NICs; compare to baseline
Random IBV_WC_RETRY_EXC_ERR in NCCL logsBuffer overflowed despite PFC — headroom too small for cable/RTTSwitch drop counters on the RoCE priority; check headroom config
Half of GPUs in a job are slowHash polarization on ECMP (not a CC issue, but presents like one)ECMP load distribution per leaf-spine link
Training hangs mid-stepPFC deadlock — cyclic dependency between buffersLook for "victim" flow patterns; consider buffer reorganization

💡 What you should remember

#ConceptWhy it matters
1🧩Three layers, one jobKeep Ethernet lossless enough for RDMA's drop-intolerant go-back-N.
2🔨PFC = sledgehammerPer-priority pause frames. Last-resort safety net. Firing often = problem.
3🔪ECN = scalpelMarks packets, doesn't drop. Triggers DCQCN. Gentler, smarter signal.
4🔄DCQCN = the closed-loop algorithm that turns CNPs into rate adjustmentsNIC-side hardware, no CPU.
5🎚️ECN watermark below the PFC watermarkThe make-or-break tuning call — full numbers in Switch QoS.
6⚠️PFC storms are the #1 failure mode at scaleA single slow consumer can wedge a whole fabric.
7📉ECN over-tuning is the #2Throughput dies quietly with no error.

Next: How a RoCE v2 Transaction Actually Flows → — connect the dots: which layers come from IB vs Ethernet, how a 1 MB WRITE chunks into 256 packets, how the IB transport carries reliability over UDP/IP/Ethernet underneath.