DCQCN, Buffer Profiles, and Tuning at Scale
You now have the three primitives: PFC stops the sender, ECN marks early, DCQCN turns ECN signals into rate adjustments. This page is how they fit together in a production fabric — buffer sizing, where to tune first, and how to know what's wrong.
The three-layer control loop
Switch buffer fills
│
├── below min_th → no signal
├── min_th < q < max_th → ECN-mark some packets (DCQCN signal)
├── q ≥ max_th → ECN-mark all packets (DCQCN hard signal)
└── headroom exhausted → PFC PAUSE (safety net)
The control loop is layered:
- DCQCN does the steady work — receiving CNPs, gently dialing the NIC rate down before queues build.
- PFC is the safety net — if DCQCN can't keep up (a sudden burst, a misconfigured flow), PFC stops the bleeding.
A well-tuned fabric has frequent ECN marks (DCQCN working) and rare PFC pauses (safety net rarely needed). A misconfigured fabric has rare ECN and constant PFC — that's an emergency.
Buffer profiles — the foundation
Modern AI switches (NVIDIA Spectrum-4, Broadcom Tomahawk 4/5, Arista 7060X/7800) have shared buffers with multiple pools and per-priority allocations. The buffer profile decides:
- Which traffic classes share which pool
- How much guaranteed buffer each class gets
- How much headroom is reserved for lossless classes (PFC)
- How dynamic the shared pool is (alpha values)
A typical AI-fabric profile splits buffer into:
| Pool | Purpose | Typical share of total |
|---|---|---|
| Lossless (RoCE v2) | Headroom + queue for PFC priority | 40–60% |
| Lossy (management, control) | Best-effort burst absorption | 10–20% |
| Storage | Storage NIC traffic (sometimes also lossless) | 15–25% |
| Reserved | Per-port minimum, telemetry | 10–15% |
Real values depend on the switch silicon. Vendors publish reference profiles (NVIDIA Spectrum-X "AI" profile, Arista R3 profile) — start from these, then tune for your traffic mix.
Field-tested starting values
These are starting points for a 400G RoCE v2 fabric in a 256-GPU pod. Adjust based on your specific switch and workload — don't blindly copy.
Switch WRED (per RoCE priority queue)
| Parameter | 400G fabric | 800G fabric |
|---|---|---|
min_th | 100 KB | 200 KB |
max_th | 1.5 MB | 3 MB |
max_p (max marking prob) | 0.10 | 0.10 |
PFC headroom (per port)
| Cable length | Headroom @ 400G | @ 800G |
|---|---|---|
| 3 m DAC | 2 KB | 4 KB |
| 30 m AOC | 20 KB | 40 KB |
| 100 m fiber | 65 KB | 130 KB |
Most modern switches auto-detect cable length and size headroom automatically. Override only if you know what you're doing.
DCQCN (on the NIC)
| Knob | Typical | When to change |
|---|---|---|
Kmin / Kmax | match switch WRED | Always match |
α update period | 100–200 μs | Decrease if rate cuts too slow |
Rate AI | 5 Mbps | Increase if utilization stays below 80% |
Rate HAI | 50 Mbps | Increase if recovery from pause is slow |
Fast recovery threshold | 5 cycles | Decrease for very bursty workloads |
NVIDIA's reference mlnx_qos script on ConnectX-7 / ConnectX-8 includes defaults that work for most fabrics.
The tuning order
When you stand up a new fabric or change a parameter, tune in this order:
1. Get traffic classified correctly
Verify with packet captures or vendor counters that RoCE v2 traffic is hitting the correct priority queue. Most production debugging issues turn out to be misclassification.
2. Verify PFC works at all
Run a microburst test (use ib_write_bw with multiple QPs from many senders to one receiver). PFC PAUSE frames should appear on the egress that congests. If they don't, PFC is misconfigured.
3. Tune ECN so DCQCN does the work
Start with the reference WRED values. Run a sustained training-like workload. Monitor:
- Switch: ECN marks/sec per port
- NIC: CNPs sent/received
- Application: AllReduce time
If PFC is firing during steady state, lower min_th. If utilization is too low, raise max_th or lower max_p.
4. Verify under failure
Drain one rail, kill one server, induce a microburst. Confirm the fabric stays lossless (no RDMA timeouts) and recovers cleanly when the failure clears.
Symptoms → knobs
A debugging cheat sheet:
| Symptom | Likely knob |
|---|---|
| AllReduce time is unstable (varies 2× step to step) | ECN tuning — min_th too high, DCQCN reacting too late |
| PFC frames frequent on a specific port | Hot spot upstream — check ECMP balance, hash polarization on that port |
| PFC frames everywhere | Buffer profile too small, or DCQCN disabled on the senders |
| RDMA timeouts | Actual drops happening — headroom underprovisioned, or PFC priority misconfigured |
| Utilization low (<60%) but no errors | DCQCN too conservative — raise Rate AI or lower α update period |
| PFC deadlock detected | Topology has a cycle, or watchdog timeout too long. Verify spine-leaf has no loops. |
Vendor differences worth knowing
The same conceptual config translates differently across vendors:
| Function | NVIDIA Spectrum-X | Arista 7060X / 7800 | Broadcom Tomahawk (Sonic / EOS) |
|---|---|---|---|
| Buffer profile | "AI" profile (vendor-supplied) | qos service-policy with R3 profile | OpenConfig YANG or Sonic config |
| Headroom | Auto from cable detection | Auto + override | Configure per port |
| WRED | qos profile | qos profile lossless | Standard SAI / OpenConfig |
| Telemetry | NVIDIA Air / DOCA telemetry | LANZ, sFlow | gNMI streaming |
| Adaptive routing | Built-in (Spectrum-X selling point) | Custom (depends on model) | Limited |
Spectrum-X has the most aggressive AI-fabric features (adaptive routing in silicon, NIC + switch co-tuning). Arista's R3 has solid PFC + ECN + telemetry. Broadcom-based white-box is the most open but requires more manual tuning.
What you should remember
- DCQCN does the steady work; PFC is the safety net. If PFC fires often, ECN tuning is wrong.
- Match switch WRED thresholds (
Kmin/Kmax) with NIC DCQCN expectations. They must agree. - Start from vendor reference profiles (Spectrum-X AI profile, Arista R3) — don't tune from scratch.
- Tune in order: classify → verify PFC → tune ECN → test under failure.
- The most common bug is misclassification. Always verify traffic is hitting the right priority queue.
- Cable length affects headroom. Modern switches auto-detect; older ones don't.
Next: Host Networking → — drilling into the host side: SR-IOV mechanics, Multus pod attachments, GPU/Network Operator configuration. Or revisit AI Fabric Architecture for the topology context that informs QoS tuning.