PFC + ECN + DCQCN — the Lossless Trick
You know that RoCE v2 is IB's transport on Ethernet. The catch: Ethernet wasn't designed to be lossless. TCP exists because Ethernet drops, and TCP retransmits. RDMA's go-back-N retransmit logic is far cruder than TCP's — a single drop on a RoCE v2 fabric can tank throughput, because the loss of one packet forces retransmission of everything after it.
So you can't run RDMA on raw Ethernet. You have to make it behave lossless enough that drops are extremely rare — which is exactly the PFC + ECN + DCQCN stack you configured back in Switch QoS. This page is the RoCE-transport view of that stack: why it's the thing standing between your fabric and a throughput collapse, the one tuning rule that decides whether it works, and how to confirm it's actually running on the wire.
The frame formats, headroom sizing, ECN/WRED marking curves, the DCQCN rate-control formula, and the vendor CLI were all covered in Switch QoS (ch.06), back in Phase 3. This page doesn't re-derive them — it frames them from the RoCE side and links back wherever you want the depth.
- Say why RoCE specifically needs the lossless stack —
go-back-Nretransmit is cruder than TCP, so one drop stalls the whole transfer. - Place the three layers in firing order — ECN marks early (L3) → DCQCN reacts at the NIC → PFC PAUSE is the last-resort L2 net that should almost never fire.
- State the one tuning rule that matters — keep the ECN threshold well below the PFC XOFF threshold (the field-tested numbers are in Switch QoS).
- Confirm RoCE v2 + DCQCN on the wire —
tcpdump 'udp port 4791', GID[3], and the NIC CNP /adp_retranscounters.
The three layers, from the RoCE side
Three mechanisms combine to approximate InfiniBand's credit-based guarantee on commodity Ethernet. RoCE v2 needs all three precisely because none of them is as clean as IB's link-layer credits — you're approximating losslessness with feedback loops instead of one hardware guarantee.
| Mechanism | Layer | Role | Mechanics & config |
|---|---|---|---|
| ECN | IP (L3) | The scalpel — marks packets (CE bit) at an early queue threshold. Per-flow, gentle, no side effects. | 6.2 — ECN |
| DCQCN | NIC (algorithm) | The glue — turns CE marks → CNP → a NIC-side rate cut, then ramps back up when the CNPs stop. | 6.2 — ECN · 6.3 — Tuning |
| PFC | Ethernet (L2) | The sledgehammer — a PAUSE frame at a late threshold stops the whole priority class. Last-resort safety net. | 6.1 — PFC |
The intended order: ECN fires first (mild signal), DCQCN brings rates down, PFC catches only what slips through.
If ECN + DCQCN are tuned right, PFC almost never fires. PFC firing in steady state is a yellow light — your fabric is closer to dropping than you wanted.
The one tuning rule that decides everything
Whether this stack hits line rate or storms itself into a corner comes down to one relationship: the ECN marking threshold must sit well below the PFC XOFF threshold.
- ECN too high → it doesn't warn early enough → the queue fills → PFC fires → storms.
- ECN too low → DCQCN backs off constantly → throughput quietly bleeds, with no error to point at.
Most fabrics land the ECN watermark around ~20–30% of buffer depth and PFC XOFF around ~80% — but the exact watermarks, buffer profiles, and field-tested starting values are switch-specific and live in 6.3 — DCQCN, Buffer Profiles & Tuning.
Confirm RoCE v2 is actually on the wire
The other side of the theory: verifying with ibv_devinfo, tcpdump 'udp port 4791', and ethtool that RoCE v2 traffic is flowing and DCQCN is doing its job:
Highlights: GID[3] showing RoCE v2 derived from IPv4, tcpdump capturing line-rate packets on UDP 4791, NIC counters showing CNPs (DCQCN actively rate-limiting) and zero adp_retrans (no drops triggered the IB retransmit safety net).
Failure modes operators actually hit
When this stack is misconfigured, here's how it shows up on a RoCE v2 fabric — and where to look. (The switch-side diagnosis for each lives in Switch QoS and Production Operations.)
| Symptom | Likely cause | First thing to check |
|---|---|---|
| Training step time spikes from 200 ms to seconds | PFC storm — slow consumer backpressuring the fabric | PFC counters on every switch in the path; find the link with the highest pause count |
| Throughput dropped from 380 → 200 Gbps, no PFC pauses visible | ECN over-tuned (firing too aggressively) → DCQCN too conservative | ECN counter rate, CNP rate at NICs; compare to baseline |
Random IBV_WC_RETRY_EXC_ERR in NCCL logs | Buffer overflowed despite PFC — headroom too small for cable/RTT | Switch drop counters on the RoCE priority; check headroom config |
| Half of GPUs in a job are slow | Hash polarization on ECMP (not a CC issue, but presents like one) | ECMP load distribution per leaf-spine link |
| Training hangs mid-step | PFC deadlock — cyclic dependency between buffers | Look for "victim" flow patterns; consider buffer reorganization |
💡 What you should remember
| # | Concept | Why it matters | |
|---|---|---|---|
| 1 | 🧩 | Three layers, one job | Keep Ethernet lossless enough for RDMA's drop-intolerant go-back-N. |
| 2 | 🔨 | PFC = sledgehammer | Per-priority pause frames. Last-resort safety net. Firing often = problem. |
| 3 | 🔪 | ECN = scalpel | Marks packets, doesn't drop. Triggers DCQCN. Gentler, smarter signal. |
| 4 | 🔄 | DCQCN = the closed-loop algorithm that turns CNPs into rate adjustments | NIC-side hardware, no CPU. |
| 5 | 🎚️ | ECN watermark below the PFC watermark | The make-or-break tuning call — full numbers in Switch QoS. |
| 6 | ⚠️ | PFC storms are the #1 failure mode at scale | A single slow consumer can wedge a whole fabric. |
| 7 | 📉 | ECN over-tuning is the #2 | Throughput dies quietly with no error. |
Next: How a RoCE v2 Transaction Actually Flows → — connect the dots: which layers come from IB vs Ethernet, how a 1 MB WRITE chunks into 256 packets, how the IB transport carries reliability over UDP/IP/Ethernet underneath.