Skip to main content

PFC — Priority Flow Control

PFC (IEEE 802.1Qbb) is the link-level mechanism that makes Ethernet "lossless." When a switch's egress buffer fills up on a particular priority class, it sends a PAUSE frame upstream telling the sender to stop sending on that class. The sender stops, the buffer drains, the switch sends an "unpause" (or the pause timer expires), traffic resumes.

The whole point: no drops under congestion on the priority class carrying RoCE v2 traffic. RDMA's go-back-N retransmit model doesn't tolerate drops gracefully, so we engineer the fabric to not drop them.

PFC mechanics across three nodes — sender NIC, middle switch, receiver NIC. Traffic flows left to right. When the middle switch's buffer crosses the XOFF watermark on priority 3, it sends a PFC PAUSE frame upstream on that priority only. Sender stops on that class for the pause quanta. Bytes in flight during the round trip must fit in reserved headroom or they drop anyway. 6 numbered steps below explain the sequence.
PFC pauses *after* congestion starts. The bytes already in flight need headroom buffer to land in — that's the operator's sizing problem.

This page explains how PFC actually works and how to configure it without creating worse problems.

After this page, you'll be able to
  1. Decode a PFC PAUSE frame — EtherType, opcode, the 8-bit priority vector, and the quanta math (1 quantum ≈ 1.28 ns at 400G).
  2. Size the headroom buffer2 × propagation × bandwidth; know why 100 m fiber needs ~50 KB at 400G and why DAC needs ~1.5 KB.
  3. Explain head-of-line blocking and PFC storms — and why "PFC fires rarely" is the entire goal of your ECN tuning.
  4. Configure PFC end-to-end — DSCP-to-priority mapping, per-interface enable, watchdog with 100 ms timeout.

The PAUSE frame

A PFC PAUSE frame is a fixed 64-byte Ethernet frame — the minimum legal frame size. Here it is, byte by byte:

The 64-byte PFC PAUSE frame laid out field by field: Destination MAC 6 B (01:80:C2:00:00:01, a reserved multicast), Source MAC 6 B, EtherType 2 B (0x8808), Opcode 2 B (0x0101), priority-enable vector 2 B, time[0..7] 16 B, Pad 26 B, FCS 4 B. A zoom on the priority-enable vector shows the low 8 bits with bit 3 set, meaning act on priority 3 (RoCE), and its time[3] carries the pause duration in 512-bit-time quanta with time=0 meaning resume.
All Layer 2 — no IP, no port, no connection. The reserved destination MAC is consumed by the link partner and never forwarded, so a PAUSE pauses exactly one hop.

Two fields carry all the meaning:

  • Priority-enable vector — a bitmask. Bit i set means "the time[i] value below is valid — act on priority i." RoCE v2 sets bit 3.
  • time[i] quanta — how long to pause priority i. One quantum = 512 bit-times. At 400G that's ≈ 1.28 ns, so the max value (65535) stops a sender for ≈ 84 μs. time[i] = 0 is the un-pause (XON) signal — resume immediately. While congestion persists the switch sends refresh PAUSE frames before the timer expires; the moment the queue drains it sends time = 0.

The receiving port's MAC acts on this directly — it stops dequeuing the named priority for the named duration. No software runs. There is no ACK, no sequence number, no retransmit, no connection state. It's the closest thing Ethernet has to a hardware reflex.


What layer is PFC? (not the one you reach for)

Network engineers reflexively file "flow control" under TCP. PFC is nowhere near TCP — it isn't even IP.

PFC is a Layer 2 mechanism: IEEE 802.1Qbb, living in the Ethernet MAC Control sublayer. Look back at that frame — there is no IP header, no UDP or TCP header, no port, no sequence number, no connection. The destination MAC 01:80:C2:00:00:01 is a reserved address that the directly-attached link partner consumes and never forwards. A PAUSE frame physically cannot be routed; it dies at the other end of the wire it was sent on.

That single-hop scope is the whole mental model:

TCP flow control (rwnd)ECN / DCQCNPFC
LayerL4 (transport)L3 mark + NIC reactionL2 (MAC control)
Scopeend-to-end, per connectionend-to-end, per QP/flowone hop, link-local
Granularityper TCP connectionper flow / QPper priority class (0–7)
Carrieswindow size in the ACKCE bit, then a CNPa pause duration in quanta
Who reactsthe sending TCP stackthe sender NIC's rate limiterthe upstream port's MAC
Reaction time~1 RTT~1 RTTmicroseconds, same link

So when RoCE congestion builds, the backpressure walks the fabric hop by hop: the congested switch PAUSEs its immediate upstream, that switch's buffer then fills and it PAUSEs its upstream, and so on. Each link makes its own local decision with zero knowledge of flows, connections, or endpoints. That hop-by-hop propagation is exactly why a single hot port can become a fabric-wide PFC storm (below), and why PFC only protects the RoCE priority so the NIC's go-back-N retransmit (RoCE v2 end-to-end) almost never has to fire.


Per-priority — why this matters

Old 802.3x pause was global — one PAUSE frame stopped all traffic on the link. Useless in modern DCs because management / storage / control would freeze when bulk data congested.

PFC is per-priority — you assign each traffic class (CoS / DSCP) to a separate priority queue at the egress, and PFC pauses only the priority that's full. RoCE v2 traffic typically lives on priority 3 by convention (sometimes 4 — depends on the operator).

Practical implication: you need a QoS classification on every switch that marks RoCE v2 traffic into the lossless priority and also configures PFC enable for that priority.

ingress: classify RoCE v2 → priority 3
queue: priority 3 → lossless queue (PFC enabled)
egress: priority 3 → lossless queue (PFC monitored)

Buffers and headroom

PFC works by stopping the sender before the buffer overflows. That requires the switch to send the PAUSE frame early enough that even in-flight bytes can fit when they arrive.

  • xoff threshold — buffer occupancy at which the switch starts sending PAUSEs.
  • headroom — buffer reserved above xoff, sized to absorb the in-flight bytes that arrive after the PAUSE is sent but before the sender stops.

Headroom sizing is 2 × cable_propagation_delay × link_bandwidth:

Cable typeLengthOne-way prop delayHeadroom @ 400G
DAC / passive copper3 m~15 ns~1.5 KB
AOC / fiber30 m~150 ns~15 KB
Fiber across rack100 m~500 ns~50 KB

Get headroom wrong → drops happen anyway → "lossless" isn't. Most modern switches calculate headroom dynamically based on cable length detection. Older switches require manual configuration.

Why InfiniBand needs none of this

This headroom math exists only because Ethernet has no credit mechanism. InfiniBand's credit-based flow control never lets the sender oversubscribe the receiver in the first place — so there's no overflow to catch and no headroom to size. PFC is Ethernet reacting to congestion after it starts; credits are IB preventing it before a single byte is sent. Same goal, opposite philosophy.


Head-of-line blocking — PFC's biggest sin

PFC works on priority, not on destination. When PFC pauses priority 3 on an ingress port, all priority-3 traffic from that ingress is paused — even if it's destined for a port that's not congested.

So if Server A's NIC is sending PFC PAUSEs to its ToR (because Server A's incast buffer is full), the ToR upstream link is paused. Other servers connected to that ToR can't send their priority-3 traffic to anyone until A's pause clears. Head-of-line blocking on the ingress.

This compounds across the fabric — the ToR PAUSEs the spine, the spine PAUSEs other ToRs, etc. PFC storms propagate backward through the network.

The mitigation:

  1. ECN does the real work — by the time PFC fires, you've already lost. DCQCN should have dialed back rates before PFC was needed. Aim for "PFC fires rarely."
  2. Per-port PFC counters — monitor pause frames sent and pause frames received per port. Sudden spikes = a hot spot somewhere upstream.
  3. PFC watchdog — most switches can detect "stuck" PAUSE conditions and forcibly clear them after a timeout (typically 100 ms). Saves the fabric from a runaway storm.
Anti-pattern

Enabling PFC on multiple priorities "just to be safe." Every extra lossless priority is another deadlock surface and another headroom budget you have to size. Only the priority that carries RoCE v2 traffic should be PFC-enabled — that's priority 3 (CoS 3 / DSCP 26) in most fabrics. Storage, management, control: leave them lossy. PFC on all 8 priorities is a 2008-era HPC config that nobody runs in production today.


The PFC deadlock failure mode

In some topologies (especially with cyclic dependencies — uncommon but real), PFC can produce a deadlock: A pauses B, B pauses C, C pauses A. Nothing progresses. PFC watchdog is the only thing that breaks it.

In a properly engineered spine-leaf, deadlocks shouldn't happen. But it's worth knowing they exist when reading vendor docs and tuning.


How to configure (Arista / Cisco / Juniper / NVIDIA)

Per-vendor specifics differ, but the conceptual config is the same on all of them. Conceptual outline:

  1. Define a QoS map that marks DSCP 26 (or whatever you've standardized on) → priority 3 → lossless queue.
  2. Enable PFC on priority 3 on every interface participating in the fabric.
  3. Set buffer headroom explicitly if not auto-detected by the silicon.
  4. Enable PFC watchdog with a sensible timeout (100 ms typical).
  5. Verify with show priority-flow-control and counters.

Here's that recipe on all four major stacks. Pick your vendor — the keywords change, the priority-3 no-drop intent does not:

qos map dscp 26 to traffic-class 3
interface Ethernet1/1
priority-flow-control on
priority-flow-control priority 3 no-drop
priority-flow-control mode auto
priority-flow-control watchdog action drop timer 100

These are reference snippets — exact keywords are version- and platform-sensitive (NX-OS queue model, Junos platform). The full multi-vendor walk-through with per-vendor caveats and doc links is on Configure the Fabric.

See a PFC storm forming — and triage it

Real triage walk-through: PagerDuty fires "training slow", you look at PFC counters per port and find the storm root cause in 90 seconds.

MODULE switch-qos · LAB 1Watch the recording — every command, every counter, every output.

Highlights: spotting one port pulsing 134 PAUSE frames/sec, tracing it to the host behind it, finding a hung GPU. Backpressure climbed all the way from the consumer to the leaf.


💡 What you should remember

#ConceptWhy it matters
1⏸️PFC = link-level PAUSE per priorityThe MAC Control frame stops the sender on the indicated priority class. Other classes keep flowing.
23️⃣RoCE v2 lives on priority 3CoS 3 / DSCP 26 by convention. Configure it once and apply identically across every switch in the fabric.
3📦Headroom must cover in-flight bytes2 × cable_prop × bandwidth. 100 m fiber @ 400G ≈ 50 KB. Get this wrong and "lossless" isn't.
4🚧PFC = head-of-line blockingPause is per-priority, not per-flow. Innocent flows freeze with the bad ones. ECN should make PFC almost never fire.
5🌀Storms propagate backwardToR pauses spine, spine pauses other ToRs. Enable PFC watchdog (100 ms timeout) on every interface.
6🪢PFC is the safety netIf it's firing in steady state, ECN/DCQCN tuning is broken. Treat frequent PAUSE counters as a tuning emergency.
7🧭PFC is L2 — not TCP, not even IPA 64-byte MAC Control frame (EtherType 0x8808, dst 01:80:C2:00:00:01). No port, no connection, can't be routed. It pauses one hop; backpressure walks the fabric hop by hop.

Next: ECN — Explicit Congestion Notification → — the early-warning system that should mean PFC almost never fires.