Skip to main content

DCQCN, Buffer Profiles, and Tuning at Scale

You now have the three primitives: PFC stops the sender, ECN marks early, DCQCN turns ECN signals into rate adjustments. This page is how they fit together in a production fabric — buffer sizing, where to tune first, and how to know what's wrong.


The three-layer control loop

Switch buffer fills

├── below min_th → no signal
├── min_th < q < max_th → ECN-mark some packets (DCQCN signal)
├── q ≥ max_th → ECN-mark all packets (DCQCN hard signal)
└── headroom exhausted → PFC PAUSE (safety net)

The control loop is layered:

  1. DCQCN does the steady work — receiving CNPs, gently dialing the NIC rate down before queues build.
  2. PFC is the safety net — if DCQCN can't keep up (a sudden burst, a misconfigured flow), PFC stops the bleeding.

A well-tuned fabric has frequent ECN marks (DCQCN working) and rare PFC pauses (safety net rarely needed). A misconfigured fabric has rare ECN and constant PFC — that's an emergency.


Buffer profiles — the foundation

Modern AI switches (NVIDIA Spectrum-4, Broadcom Tomahawk 4/5, Arista 7060X/7800) have shared buffers with multiple pools and per-priority allocations. The buffer profile decides:

  • Which traffic classes share which pool
  • How much guaranteed buffer each class gets
  • How much headroom is reserved for lossless classes (PFC)
  • How dynamic the shared pool is (alpha values)

A typical AI-fabric profile splits buffer into:

PoolPurposeTypical share of total
Lossless (RoCE v2)Headroom + queue for PFC priority40–60%
Lossy (management, control)Best-effort burst absorption10–20%
StorageStorage NIC traffic (sometimes also lossless)15–25%
ReservedPer-port minimum, telemetry10–15%

Real values depend on the switch silicon. Vendors publish reference profiles (NVIDIA Spectrum-X "AI" profile, Arista R3 profile) — start from these, then tune for your traffic mix.


Field-tested starting values

These are starting points for a 400G RoCE v2 fabric in a 256-GPU pod. Adjust based on your specific switch and workload — don't blindly copy.

Switch WRED (per RoCE priority queue)

Parameter400G fabric800G fabric
min_th100 KB200 KB
max_th1.5 MB3 MB
max_p (max marking prob)0.100.10

PFC headroom (per port)

Cable lengthHeadroom @ 400G@ 800G
3 m DAC2 KB4 KB
30 m AOC20 KB40 KB
100 m fiber65 KB130 KB

Most modern switches auto-detect cable length and size headroom automatically. Override only if you know what you're doing.

DCQCN (on the NIC)

KnobTypicalWhen to change
Kmin / Kmaxmatch switch WREDAlways match
α update period100–200 μsDecrease if rate cuts too slow
Rate AI5 MbpsIncrease if utilization stays below 80%
Rate HAI50 MbpsIncrease if recovery from pause is slow
Fast recovery threshold5 cyclesDecrease for very bursty workloads

NVIDIA's reference mlnx_qos script on ConnectX-7 / ConnectX-8 includes defaults that work for most fabrics.


The tuning order

When you stand up a new fabric or change a parameter, tune in this order:

1. Get traffic classified correctly

Verify with packet captures or vendor counters that RoCE v2 traffic is hitting the correct priority queue. Most production debugging issues turn out to be misclassification.

2. Verify PFC works at all

Run a microburst test (use ib_write_bw with multiple QPs from many senders to one receiver). PFC PAUSE frames should appear on the egress that congests. If they don't, PFC is misconfigured.

3. Tune ECN so DCQCN does the work

Start with the reference WRED values. Run a sustained training-like workload. Monitor:

  • Switch: ECN marks/sec per port
  • NIC: CNPs sent/received
  • Application: AllReduce time

If PFC is firing during steady state, lower min_th. If utilization is too low, raise max_th or lower max_p.

4. Verify under failure

Drain one rail, kill one server, induce a microburst. Confirm the fabric stays lossless (no RDMA timeouts) and recovers cleanly when the failure clears.


Symptoms → knobs

A debugging cheat sheet:

SymptomLikely knob
AllReduce time is unstable (varies 2× step to step)ECN tuning — min_th too high, DCQCN reacting too late
PFC frames frequent on a specific portHot spot upstream — check ECMP balance, hash polarization on that port
PFC frames everywhereBuffer profile too small, or DCQCN disabled on the senders
RDMA timeoutsActual drops happening — headroom underprovisioned, or PFC priority misconfigured
Utilization low (<60%) but no errorsDCQCN too conservative — raise Rate AI or lower α update period
PFC deadlock detectedTopology has a cycle, or watchdog timeout too long. Verify spine-leaf has no loops.

Vendor differences worth knowing

The same conceptual config translates differently across vendors:

FunctionNVIDIA Spectrum-XArista 7060X / 7800Broadcom Tomahawk (Sonic / EOS)
Buffer profile"AI" profile (vendor-supplied)qos service-policy with R3 profileOpenConfig YANG or Sonic config
HeadroomAuto from cable detectionAuto + overrideConfigure per port
WREDqos profileqos profile losslessStandard SAI / OpenConfig
TelemetryNVIDIA Air / DOCA telemetryLANZ, sFlowgNMI streaming
Adaptive routingBuilt-in (Spectrum-X selling point)Custom (depends on model)Limited

Spectrum-X has the most aggressive AI-fabric features (adaptive routing in silicon, NIC + switch co-tuning). Arista's R3 has solid PFC + ECN + telemetry. Broadcom-based white-box is the most open but requires more manual tuning.


What you should remember

  • DCQCN does the steady work; PFC is the safety net. If PFC fires often, ECN tuning is wrong.
  • Match switch WRED thresholds (Kmin/Kmax) with NIC DCQCN expectations. They must agree.
  • Start from vendor reference profiles (Spectrum-X AI profile, Arista R3) — don't tune from scratch.
  • Tune in order: classify → verify PFC → tune ECN → test under failure.
  • The most common bug is misclassification. Always verify traffic is hitting the right priority queue.
  • Cable length affects headroom. Modern switches auto-detect; older ones don't.

Next: Host Networking → — drilling into the host side: SR-IOV mechanics, Multus pod attachments, GPU/Network Operator configuration. Or revisit AI Fabric Architecture for the topology context that informs QoS tuning.