PFC — Priority Flow Control
PFC (IEEE 802.1Qbb) is the link-level mechanism that makes Ethernet "lossless." When a switch's egress buffer fills up on a particular priority class, it sends a PAUSE frame upstream telling the sender to stop sending on that class. The sender stops, the buffer drains, the switch sends an "unpause" (or the pause timer expires), traffic resumes.
The whole point: no drops under congestion on the priority class carrying RoCE v2 traffic. RDMA's go-back-N retransmit model doesn't tolerate drops gracefully, so we engineer the fabric to not drop them.
This page explains how PFC actually works and how to configure it without creating worse problems.
- Decode a PFC PAUSE frame — EtherType, opcode, the 8-bit priority vector, and the quanta math (1 quantum ≈ 1.28 ns at 400G).
- Size the headroom buffer —
2 × propagation × bandwidth; know why 100 m fiber needs ~50 KB at 400G and why DAC needs ~1.5 KB. - Explain head-of-line blocking and PFC storms — and why "PFC fires rarely" is the entire goal of your ECN tuning.
- Configure PFC end-to-end — DSCP-to-priority mapping, per-interface enable, watchdog with 100 ms timeout.
The PAUSE frame
A PFC PAUSE frame is a fixed 64-byte Ethernet frame — the minimum legal frame size. Here it is, byte by byte:
Two fields carry all the meaning:
- Priority-enable vector — a bitmask. Bit
iset means "thetime[i]value below is valid — act on priorityi." RoCE v2 sets bit 3. time[i]quanta — how long to pause priorityi. One quantum =512 bit-times. At 400G that's ≈ 1.28 ns, so the max value (65535) stops a sender for ≈ 84 μs.time[i] = 0is the un-pause (XON) signal — resume immediately. While congestion persists the switch sends refresh PAUSE frames before the timer expires; the moment the queue drains it sendstime = 0.
The receiving port's MAC acts on this directly — it stops dequeuing the named priority for the named duration. No software runs. There is no ACK, no sequence number, no retransmit, no connection state. It's the closest thing Ethernet has to a hardware reflex.
What layer is PFC? (not the one you reach for)
Network engineers reflexively file "flow control" under TCP. PFC is nowhere near TCP — it isn't even IP.
PFC is a Layer 2 mechanism: IEEE 802.1Qbb, living in the Ethernet MAC Control sublayer. Look back at that frame — there is no IP header, no UDP or TCP header, no port, no sequence number, no connection. The destination MAC 01:80:C2:00:00:01 is a reserved address that the directly-attached link partner consumes and never forwards. A PAUSE frame physically cannot be routed; it dies at the other end of the wire it was sent on.
That single-hop scope is the whole mental model:
TCP flow control (rwnd) | ECN / DCQCN | PFC | |
|---|---|---|---|
| Layer | L4 (transport) | L3 mark + NIC reaction | L2 (MAC control) |
| Scope | end-to-end, per connection | end-to-end, per QP/flow | one hop, link-local |
| Granularity | per TCP connection | per flow / QP | per priority class (0–7) |
| Carries | window size in the ACK | CE bit, then a CNP | a pause duration in quanta |
| Who reacts | the sending TCP stack | the sender NIC's rate limiter | the upstream port's MAC |
| Reaction time | ~1 RTT | ~1 RTT | microseconds, same link |
So when RoCE congestion builds, the backpressure walks the fabric hop by hop: the congested switch PAUSEs its immediate upstream, that switch's buffer then fills and it PAUSEs its upstream, and so on. Each link makes its own local decision with zero knowledge of flows, connections, or endpoints. That hop-by-hop propagation is exactly why a single hot port can become a fabric-wide PFC storm (below), and why PFC only protects the RoCE priority so the NIC's go-back-N retransmit (RoCE v2 end-to-end) almost never has to fire.
Per-priority — why this matters
Old 802.3x pause was global — one PAUSE frame stopped all traffic on the link. Useless in modern DCs because management / storage / control would freeze when bulk data congested.
PFC is per-priority — you assign each traffic class (CoS / DSCP) to a separate priority queue at the egress, and PFC pauses only the priority that's full. RoCE v2 traffic typically lives on priority 3 by convention (sometimes 4 — depends on the operator).
Practical implication: you need a QoS classification on every switch that marks RoCE v2 traffic into the lossless priority and also configures PFC enable for that priority.
ingress: classify RoCE v2 → priority 3
queue: priority 3 → lossless queue (PFC enabled)
egress: priority 3 → lossless queue (PFC monitored)
Buffers and headroom
PFC works by stopping the sender before the buffer overflows. That requires the switch to send the PAUSE frame early enough that even in-flight bytes can fit when they arrive.
xoff threshold— buffer occupancy at which the switch starts sending PAUSEs.headroom— buffer reserved above xoff, sized to absorb the in-flight bytes that arrive after the PAUSE is sent but before the sender stops.
Headroom sizing is 2 × cable_propagation_delay × link_bandwidth:
| Cable type | Length | One-way prop delay | Headroom @ 400G |
|---|---|---|---|
| DAC / passive copper | 3 m | ~15 ns | ~1.5 KB |
| AOC / fiber | 30 m | ~150 ns | ~15 KB |
| Fiber across rack | 100 m | ~500 ns | ~50 KB |
Get headroom wrong → drops happen anyway → "lossless" isn't. Most modern switches calculate headroom dynamically based on cable length detection. Older switches require manual configuration.
This headroom math exists only because Ethernet has no credit mechanism. InfiniBand's credit-based flow control never lets the sender oversubscribe the receiver in the first place — so there's no overflow to catch and no headroom to size. PFC is Ethernet reacting to congestion after it starts; credits are IB preventing it before a single byte is sent. Same goal, opposite philosophy.
Head-of-line blocking — PFC's biggest sin
PFC works on priority, not on destination. When PFC pauses priority 3 on an ingress port, all priority-3 traffic from that ingress is paused — even if it's destined for a port that's not congested.
So if Server A's NIC is sending PFC PAUSEs to its ToR (because Server A's incast buffer is full), the ToR upstream link is paused. Other servers connected to that ToR can't send their priority-3 traffic to anyone until A's pause clears. Head-of-line blocking on the ingress.
This compounds across the fabric — the ToR PAUSEs the spine, the spine PAUSEs other ToRs, etc. PFC storms propagate backward through the network.
The mitigation:
- ECN does the real work — by the time PFC fires, you've already lost. DCQCN should have dialed back rates before PFC was needed. Aim for "PFC fires rarely."
- Per-port PFC counters — monitor
pause frames sentandpause frames receivedper port. Sudden spikes = a hot spot somewhere upstream. - PFC watchdog — most switches can detect "stuck" PAUSE conditions and forcibly clear them after a timeout (typically 100 ms). Saves the fabric from a runaway storm.
Enabling PFC on multiple priorities "just to be safe." Every extra lossless priority is another deadlock surface and another headroom budget you have to size. Only the priority that carries RoCE v2 traffic should be PFC-enabled — that's priority 3 (CoS 3 / DSCP 26) in most fabrics. Storage, management, control: leave them lossy. PFC on all 8 priorities is a 2008-era HPC config that nobody runs in production today.
The PFC deadlock failure mode
In some topologies (especially with cyclic dependencies — uncommon but real), PFC can produce a deadlock: A pauses B, B pauses C, C pauses A. Nothing progresses. PFC watchdog is the only thing that breaks it.
In a properly engineered spine-leaf, deadlocks shouldn't happen. But it's worth knowing they exist when reading vendor docs and tuning.
How to configure (Arista / Cisco / Juniper / NVIDIA)
Per-vendor specifics differ, but the conceptual config is the same on all of them. Conceptual outline:
- Define a QoS map that marks DSCP
26(or whatever you've standardized on) → priority3→ lossless queue. - Enable PFC on priority
3on every interface participating in the fabric. - Set buffer headroom explicitly if not auto-detected by the silicon.
- Enable PFC watchdog with a sensible timeout (100 ms typical).
- Verify with
show priority-flow-controland counters.
Here's that recipe on all four major stacks. Pick your vendor — the keywords change, the priority-3 no-drop intent does not:
- 1. Arista EOS
- 2. Cisco NX-OS
- 3. Juniper Junos
- 4. NVIDIA Spectrum
qos map dscp 26 to traffic-class 3
interface Ethernet1/1
priority-flow-control on
priority-flow-control priority 3 no-drop
priority-flow-control mode auto
priority-flow-control watchdog action drop timer 100
class-map type qos match-any ROCE
match dscp 26
policy-map type qos ROCE-CLASSIFY
class ROCE
set qos-group 3
!
class-map type network-qos ROCE-NQ
match qos-group 3
policy-map type network-qos ROCE-NQ-POLICY
class type network-qos ROCE-NQ
pause pfc-cos 3
system qos
service-policy type network-qos ROCE-NQ-POLICY
!
interface Ethernet1/1
service-policy type qos input ROCE-CLASSIFY
priority-flow-control mode on
Use mode on, not auto — auto-negotiation is inconsistent across NX-OS firmware. PFC no-drop is declared with pause pfc-cos 3 in the network-qos policy.
# DSCP 26 = code-point 011010 → no-loss forwarding-class (queue 3)
set class-of-service classifiers dscp ROCE forwarding-class no-loss loss-priority low code-points 011010
set class-of-service forwarding-classes class no-loss queue-num 3
set class-of-service interfaces et-0/0/0 unit 0 classifiers dscp ROCE
# PFC on 802.1p code-point 011 (priority 3)
set class-of-service congestion-notification-profile ROCE-PFC input ieee-802.1 code-point 011 pfc
set class-of-service interfaces et-0/0/0 congestion-notification-profile ROCE-PFC
Junos hangs PFC off the 802.1p code-point, so make sure your DSCP classifier and the congestion-notification-profile both land on priority 3.
nv set qos roce mode lossless
nv set qos roce pfc priority 3
nv set qos roce pfc watchdog timer 100
nv config apply
nv set qos roce configures the whole DSCP-26 → priority-3 → PFC pipeline in a handful of commands with AI-fabric defaults.
These are reference snippets — exact keywords are version- and platform-sensitive (NX-OS queue model, Junos platform). The full multi-vendor walk-through with per-vendor caveats and doc links is on Configure the Fabric.
See a PFC storm forming — and triage it
Real triage walk-through: PagerDuty fires "training slow", you look at PFC counters per port and find the storm root cause in 90 seconds.
Highlights: spotting one port pulsing 134 PAUSE frames/sec, tracing it to the host behind it, finding a hung GPU. Backpressure climbed all the way from the consumer to the leaf.
💡 What you should remember
| # | Concept | Why it matters | |
|---|---|---|---|
| 1 | ⏸️ | PFC = link-level PAUSE per priority | The MAC Control frame stops the sender on the indicated priority class. Other classes keep flowing. |
| 2 | 3️⃣ | RoCE v2 lives on priority 3 | CoS 3 / DSCP 26 by convention. Configure it once and apply identically across every switch in the fabric. |
| 3 | 📦 | Headroom must cover in-flight bytes | 2 × cable_prop × bandwidth. 100 m fiber @ 400G ≈ 50 KB. Get this wrong and "lossless" isn't. |
| 4 | 🚧 | PFC = head-of-line blocking | Pause is per-priority, not per-flow. Innocent flows freeze with the bad ones. ECN should make PFC almost never fire. |
| 5 | 🌀 | Storms propagate backward | ToR pauses spine, spine pauses other ToRs. Enable PFC watchdog (100 ms timeout) on every interface. |
| 6 | 🪢 | PFC is the safety net | If it's firing in steady state, ECN/DCQCN tuning is broken. Treat frequent PAUSE counters as a tuning emergency. |
| 7 | 🧭 | PFC is L2 — not TCP, not even IP | A 64-byte MAC Control frame (EtherType 0x8808, dst 01:80:C2:00:00:01). No port, no connection, can't be routed. It pauses one hop; backpressure walks the fabric hop by hop. |
Next: ECN — Explicit Congestion Notification → — the early-warning system that should mean PFC almost never fires.