PFC — Priority Flow Control

PFC (IEEE 802.1Qbb) is the link-level mechanism that makes Ethernet "lossless." When a switch's egress buffer fills up on a particular priority class, it sends a PAUSE frame upstream telling the sender to stop sending on that class. The sender stops, the buffer drains, the switch sends an "unpause" (or the pause timer expires), traffic resumes.

The whole point: no drops under congestion on the priority class carrying RoCE v2 traffic. RDMA's go-back-N retransmit model doesn't tolerate drops gracefully, so we engineer the fabric to not drop them.

PFC mechanics across three nodes — sender NIC, middle switch, receiver NIC. Traffic flows left to right. When the middle switch's buffer crosses the XOFF watermark on priority 3, it sends a PFC PAUSE frame upstream on that priority only. Sender stops on that class for the pause quanta. Bytes in flight during the round trip must fit in reserved headroom or they drop anyway. 6 numbered steps below explain the sequence. — PFC pauses *after* congestion starts. The bytes already in flight need headroom buffer to land in — that's the operator's sizing problem.

This page explains how PFC actually works and how to configure it without creating worse problems.

The PAUSE frame

A PFC PAUSE frame is a special Ethernet frame:

EtherType 0x8808 (MAC Control)
Opcode 0x0101 (PFC, not the older 802.3x global pause)
Vector — 8 bits, one per priority class (0–7). Tells the receiver which priorities to pause.
Quanta — 8 × 16-bit values, one per priority. Each quantum is 512 bit times of pause duration. At 400G, one quantum ≈ 1.28 ns; a full pause field can stop a sender for up to ~84 μs.

The sender's NIC sees the PAUSE frame and immediately stops transmitting on the indicated priorities. It does not stop other priorities — that's the "P" in PFC.

Per-priority — why this matters

Old 802.3x pause was global — one PAUSE frame stopped all traffic on the link. Useless in modern DCs because management / storage / control would freeze when bulk data congested.

PFC is per-priority — you assign each traffic class (CoS / DSCP) to a separate priority queue at the egress, and PFC pauses only the priority that's full. RoCE v2 traffic typically lives on priority 3 by convention (sometimes 4 — depends on the operator).

Practical implication: you need a QoS classification on every switch that marks RoCE v2 traffic into the lossless priority and also configures PFC enable for that priority.

ingress: classify RoCE v2 → priority 3
queue: priority 3 → lossless queue (PFC enabled)
egress: priority 3 → lossless queue (PFC monitored)

Buffers and headroom

PFC works by stopping the sender before the buffer overflows. That requires the switch to send the PAUSE frame early enough that even in-flight bytes can fit when they arrive.

xoff threshold — buffer occupancy at which the switch starts sending PAUSEs.
headroom — buffer reserved above xoff, sized to absorb the in-flight bytes that arrive after the PAUSE is sent but before the sender stops.

Headroom sizing is 2 × cable_propagation_delay × link_bandwidth:

Cable type	Length	One-way prop delay	Headroom @ 400G
DAC / passive copper	3 m	~15 ns	~1.5 KB
AOC / fiber	30 m	~150 ns	~15 KB
Fiber across rack	100 m	~500 ns	~50 KB

Get headroom wrong → drops happen anyway → "lossless" isn't. Most modern switches calculate headroom dynamically based on cable length detection. Older switches require manual configuration.

Head-of-line blocking — PFC's biggest sin

PFC works on priority, not on destination. When PFC pauses priority 3 on an ingress port, all priority-3 traffic from that ingress is paused — even if it's destined for a port that's not congested.

So if Server A's NIC is sending PFC PAUSEs to its ToR (because Server A's incast buffer is full), the ToR upstream link is paused. Other servers connected to that ToR can't send their priority-3 traffic to anyone until A's pause clears. Head-of-line blocking on the ingress.

This compounds across the fabric — the ToR PAUSEs the spine, the spine PAUSEs other ToRs, etc. PFC storms propagate backward through the network.

The mitigation:

ECN does the real work — by the time PFC fires, you've already lost. DCQCN should have dialed back rates before PFC was needed. Aim for "PFC fires rarely."
Per-port PFC counters — monitor pause frames sent and pause frames received per port. Sudden spikes = a hot spot somewhere upstream.
PFC watchdog — most switches can detect "stuck" PAUSE conditions and forcibly clear them after a timeout (typically 100 ms). Saves the fabric from a runaway storm.

The PFC deadlock failure mode

In some topologies (especially with cyclic dependencies — uncommon but real), PFC can produce a deadlock: A pauses B, B pauses C, C pauses A. Nothing progresses. PFC watchdog is the only thing that breaks it.

In a properly engineered spine-leaf, deadlocks shouldn't happen. But it's worth knowing they exist when reading vendor docs and tuning.

How to configure (Arista / NVIDIA Spectrum / Cisco)

Per-vendor specifics differ, but the conceptual config is the same on all of them. Conceptual outline:

Define a QoS map that marks DSCP 26 (or whatever you've standardized on) → priority 3 → lossless queue.
Enable PFC on priority 3 on every interface participating in the fabric.
Set buffer headroom explicitly if not auto-detected by the silicon.
Enable PFC watchdog with a sensible timeout (100 ms typical).
Verify with show priority-flow-control and counters.

If you have switch CLI access, try:

! Arista example
qos map dscp 26 to traffic-class 3
interface Ethernet1/1
   priority-flow-control on
   priority-flow-control priority 3 no-drop
   priority-flow-control mode auto
   priority-flow-control watchdog action drop timer 100

What you should remember

PFC = link-level PAUSE per priority. Tells the sender to stop on the indicated priority class.
RoCE v2 typically runs on priority 3 (CoS 3 / DSCP 26 by convention). Configure once across the whole fabric.
Headroom buffer must cover in-flight bytes after the PAUSE is sent. Get this wrong, drops happen.
PFC causes head-of-line blocking — pause is per-priority, not per-flow. ECN should mostly prevent PFC from firing in the first place.
PFC storms propagate backward. Enable PFC watchdog with a 100 ms timeout.
PFC is the safety net. If it's firing frequently, your ECN/DCQCN tuning is wrong.

Next: ECN — Explicit Congestion Notification → — the early-warning system that should mean PFC almost never fires.

The PAUSE frame​

Per-priority — why this matters​

Buffers and headroom​

Head-of-line blocking — PFC's biggest sin​

The PFC deadlock failure mode​

How to configure (Arista / NVIDIA Spectrum / Cisco)​

What you should remember​