PFC — Priority Flow Control
PFC (IEEE 802.1Qbb) is the link-level mechanism that makes Ethernet "lossless." When a switch's egress buffer fills up on a particular priority class, it sends a PAUSE frame upstream telling the sender to stop sending on that class. The sender stops, the buffer drains, the switch sends an "unpause" (or the pause timer expires), traffic resumes.
The whole point: no drops under congestion on the priority class carrying RoCE v2 traffic. RDMA's go-back-N retransmit model doesn't tolerate drops gracefully, so we engineer the fabric to not drop them.
This page explains how PFC actually works and how to configure it without creating worse problems.
The PAUSE frame
A PFC PAUSE frame is a special Ethernet frame:
- EtherType
0x8808(MAC Control) - Opcode
0x0101(PFC, not the older 802.3x global pause) - Vector — 8 bits, one per priority class (0–7). Tells the receiver which priorities to pause.
- Quanta — 8 × 16-bit values, one per priority. Each quantum is
512 bit timesof pause duration. At 400G, one quantum ≈ 1.28 ns; a full pause field can stop a sender for up to ~84 μs.
The sender's NIC sees the PAUSE frame and immediately stops transmitting on the indicated priorities. It does not stop other priorities — that's the "P" in PFC.
Per-priority — why this matters
Old 802.3x pause was global — one PAUSE frame stopped all traffic on the link. Useless in modern DCs because management / storage / control would freeze when bulk data congested.
PFC is per-priority — you assign each traffic class (CoS / DSCP) to a separate priority queue at the egress, and PFC pauses only the priority that's full. RoCE v2 traffic typically lives on priority 3 by convention (sometimes 4 — depends on the operator).
Practical implication: you need a QoS classification on every switch that marks RoCE v2 traffic into the lossless priority and also configures PFC enable for that priority.
ingress: classify RoCE v2 → priority 3
queue: priority 3 → lossless queue (PFC enabled)
egress: priority 3 → lossless queue (PFC monitored)
Buffers and headroom
PFC works by stopping the sender before the buffer overflows. That requires the switch to send the PAUSE frame early enough that even in-flight bytes can fit when they arrive.
xoff threshold— buffer occupancy at which the switch starts sending PAUSEs.headroom— buffer reserved above xoff, sized to absorb the in-flight bytes that arrive after the PAUSE is sent but before the sender stops.
Headroom sizing is 2 × cable_propagation_delay × link_bandwidth:
| Cable type | Length | One-way prop delay | Headroom @ 400G |
|---|---|---|---|
| DAC / passive copper | 3 m | ~15 ns | ~1.5 KB |
| AOC / fiber | 30 m | ~150 ns | ~15 KB |
| Fiber across rack | 100 m | ~500 ns | ~50 KB |
Get headroom wrong → drops happen anyway → "lossless" isn't. Most modern switches calculate headroom dynamically based on cable length detection. Older switches require manual configuration.
Head-of-line blocking — PFC's biggest sin
PFC works on priority, not on destination. When PFC pauses priority 3 on an ingress port, all priority-3 traffic from that ingress is paused — even if it's destined for a port that's not congested.
So if Server A's NIC is sending PFC PAUSEs to its ToR (because Server A's incast buffer is full), the ToR upstream link is paused. Other servers connected to that ToR can't send their priority-3 traffic to anyone until A's pause clears. Head-of-line blocking on the ingress.
This compounds across the fabric — the ToR PAUSEs the spine, the spine PAUSEs other ToRs, etc. PFC storms propagate backward through the network.
The mitigation:
- ECN does the real work — by the time PFC fires, you've already lost. DCQCN should have dialed back rates before PFC was needed. Aim for "PFC fires rarely."
- Per-port PFC counters — monitor
pause frames sentandpause frames receivedper port. Sudden spikes = a hot spot somewhere upstream. - PFC watchdog — most switches can detect "stuck" PAUSE conditions and forcibly clear them after a timeout (typically 100 ms). Saves the fabric from a runaway storm.
The PFC deadlock failure mode
In some topologies (especially with cyclic dependencies — uncommon but real), PFC can produce a deadlock: A pauses B, B pauses C, C pauses A. Nothing progresses. PFC watchdog is the only thing that breaks it.
In a properly engineered spine-leaf, deadlocks shouldn't happen. But it's worth knowing they exist when reading vendor docs and tuning.
How to configure (Arista / NVIDIA Spectrum / Cisco)
Per-vendor specifics differ, but the conceptual config is the same on all of them. Conceptual outline:
- Define a QoS map that marks DSCP
26(or whatever you've standardized on) → priority3→ lossless queue. - Enable PFC on priority
3on every interface participating in the fabric. - Set buffer headroom explicitly if not auto-detected by the silicon.
- Enable PFC watchdog with a sensible timeout (100 ms typical).
- Verify with
show priority-flow-controland counters.
If you have switch CLI access, try:
! Arista example
qos map dscp 26 to traffic-class 3
interface Ethernet1/1
priority-flow-control on
priority-flow-control priority 3 no-drop
priority-flow-control mode auto
priority-flow-control watchdog action drop timer 100
What you should remember
- PFC = link-level PAUSE per priority. Tells the sender to stop on the indicated priority class.
- RoCE v2 typically runs on priority 3 (CoS 3 / DSCP 26 by convention). Configure once across the whole fabric.
- Headroom buffer must cover in-flight bytes after the PAUSE is sent. Get this wrong, drops happen.
- PFC causes head-of-line blocking — pause is per-priority, not per-flow. ECN should mostly prevent PFC from firing in the first place.
- PFC storms propagate backward. Enable PFC watchdog with a 100 ms timeout.
- PFC is the safety net. If it's firing frequently, your ECN/DCQCN tuning is wrong.
Next: ECN — Explicit Congestion Notification → — the early-warning system that should mean PFC almost never fires.