DCQCN, Buffer Profiles, and Tuning at Scale

You now have the three primitives: PFC stops the sender, ECN marks early, DCQCN turns ECN signals into rate adjustments. This page is how they fit together in a production fabric — buffer sizing, where to tune first, and how to know what's wrong.

The three-layer control loop

  Switch buffer fills
        │
        ├── below min_th        →  no signal
        ├── min_th < q < max_th →  ECN-mark some packets (DCQCN signal)
        ├── q ≥ max_th          →  ECN-mark all packets (DCQCN hard signal)
        └── headroom exhausted  →  PFC PAUSE (safety net)

The control loop is layered:

DCQCN does the steady work — receiving CNPs, gently dialing the NIC rate down before queues build.
PFC is the safety net — if DCQCN can't keep up (a sudden burst, a misconfigured flow), PFC stops the bleeding.

A well-tuned fabric has frequent ECN marks (DCQCN working) and rare PFC pauses (safety net rarely needed). A misconfigured fabric has rare ECN and constant PFC — that's an emergency.

Buffer profiles — the foundation

Modern AI switches (NVIDIA Spectrum-4, Broadcom Tomahawk 4/5, Arista 7060X/7800) have shared buffers with multiple pools and per-priority allocations. The buffer profile decides:

Which traffic classes share which pool
How much guaranteed buffer each class gets
How much headroom is reserved for lossless classes (PFC)
How dynamic the shared pool is (alpha values)

A typical AI-fabric profile splits buffer into:

Pool	Purpose	Typical share of total
Lossless (RoCE v2)	Headroom + queue for PFC priority	40–60%
Lossy (management, control)	Best-effort burst absorption	10–20%
Storage	Storage NIC traffic (sometimes also lossless)	15–25%
Reserved	Per-port minimum, telemetry	10–15%

Real values depend on the switch silicon. Vendors publish reference profiles (NVIDIA Spectrum-X "AI" profile, Arista R3 profile) — start from these, then tune for your traffic mix.

Field-tested starting values

These are starting points for a 400G RoCE v2 fabric in a 256-GPU pod. Adjust based on your specific switch and workload — don't blindly copy.

Switch WRED (per RoCE priority queue)

Parameter	400G fabric	800G fabric
`min_th`	100 KB	200 KB
`max_th`	1.5 MB	3 MB
`max_p` (max marking prob)	0.10	0.10

PFC headroom (per port)

Cable length	Headroom @ 400G	@ 800G
3 m DAC	2 KB	4 KB
30 m AOC	20 KB	40 KB
100 m fiber	65 KB	130 KB

Most modern switches auto-detect cable length and size headroom automatically. Override only if you know what you're doing.

DCQCN (on the NIC)

Knob	Typical	When to change
`Kmin / Kmax`	match switch WRED	Always match
`α update period`	100–200 μs	Decrease if rate cuts too slow
`Rate AI`	5 Mbps	Increase if utilization stays below 80%
`Rate HAI`	50 Mbps	Increase if recovery from pause is slow
`Fast recovery threshold`	5 cycles	Decrease for very bursty workloads

NVIDIA's reference mlnx_qos script on ConnectX-7 / ConnectX-8 includes defaults that work for most fabrics.

The tuning order

When you stand up a new fabric or change a parameter, tune in this order:

1. Get traffic classified correctly

Verify with packet captures or vendor counters that RoCE v2 traffic is hitting the correct priority queue. Most production debugging issues turn out to be misclassification.

2. Verify PFC works at all

Run a microburst test (use ib_write_bw with multiple QPs from many senders to one receiver). PFC PAUSE frames should appear on the egress that congests. If they don't, PFC is misconfigured.

3. Tune ECN so DCQCN does the work

Start with the reference WRED values. Run a sustained training-like workload. Monitor:

Switch: ECN marks/sec per port
NIC: CNPs sent/received
Application: AllReduce time

If PFC is firing during steady state, lower min_th. If utilization is too low, raise max_th or lower max_p.

4. Verify under failure

Drain one rail, kill one server, induce a microburst. Confirm the fabric stays lossless (no RDMA timeouts) and recovers cleanly when the failure clears.

Symptoms → knobs

A debugging cheat sheet:

Symptom	Likely knob
AllReduce time is unstable (varies 2× step to step)	ECN tuning — `min_th` too high, DCQCN reacting too late
PFC frames frequent on a specific port	Hot spot upstream — check ECMP balance, hash polarization on that port
PFC frames everywhere	Buffer profile too small, or DCQCN disabled on the senders
RDMA timeouts	Actual drops happening — headroom underprovisioned, or PFC priority misconfigured
Utilization low (<60%) but no errors	DCQCN too conservative — raise `Rate AI` or lower `α update period`
PFC deadlock detected	Topology has a cycle, or watchdog timeout too long. Verify spine-leaf has no loops.

Vendor differences worth knowing

The same conceptual config translates differently across vendors:

Function	NVIDIA Spectrum-X	Arista 7060X / 7800	Broadcom Tomahawk (Sonic / EOS)
Buffer profile	"AI" profile (vendor-supplied)	`qos service-policy` with R3 profile	OpenConfig YANG or Sonic config
Headroom	Auto from cable detection	Auto + override	Configure per port
WRED	`qos profile`	`qos profile lossless`	Standard SAI / OpenConfig
Telemetry	NVIDIA Air / DOCA telemetry	LANZ, sFlow	gNMI streaming
Adaptive routing	Built-in (Spectrum-X selling point)	Custom (depends on model)	Limited

Spectrum-X has the most aggressive AI-fabric features (adaptive routing in silicon, NIC + switch co-tuning). Arista's R3 has solid PFC + ECN + telemetry. Broadcom-based white-box is the most open but requires more manual tuning.

What you should remember

DCQCN does the steady work; PFC is the safety net. If PFC fires often, ECN tuning is wrong.
Match switch WRED thresholds (Kmin/Kmax) with NIC DCQCN expectations. They must agree.
Start from vendor reference profiles (Spectrum-X AI profile, Arista R3) — don't tune from scratch.
Tune in order: classify → verify PFC → tune ECN → test under failure.
The most common bug is misclassification. Always verify traffic is hitting the right priority queue.
Cable length affects headroom. Modern switches auto-detect; older ones don't.

Next: Host Networking → — drilling into the host side: SR-IOV mechanics, Multus pod attachments, GPU/Network Operator configuration. Or revisit AI Fabric Architecture for the topology context that informs QoS tuning.

The three-layer control loop​

Buffer profiles — the foundation​

Field-tested starting values​

Switch WRED (per RoCE priority queue)​

PFC headroom (per port)​

DCQCN (on the NIC)​

The tuning order​

1. Get traffic classified correctly​

2. Verify PFC works at all​

3. Tune ECN so DCQCN does the work​

4. Verify under failure​

Symptoms → knobs​

Vendor differences worth knowing​

What you should remember​