Hyperscaler RoCE Stacks

The previous pages laid out the design space — transports and congestion control algorithms. This page is the field map: what every major hyperscaler actually ships in production, and why.

If you sat down with a network engineer from Meta or AWS tomorrow, half their vocabulary would be different. They'd talk about VOQ, cells, Falcon, EFA, SRD, INT, Swift, DSF, UEC. They aren't doing magic. They're just at a different point on the same design space, optimizing for a different scale, a different cost curve, a different customer.

After this page, you'll be able to

List the four escape routes from PFC — smarter end-host CC, cell-switched fabrics, custom transports, adaptive routing on RoCEv2.
Map each hyperscaler to a route — Meta → DSF cells, Google → Falcon, AWS → SRD, NVIDIA → Spectrum-X adaptive, Microsoft → DCQCN at 400G, Alibaba → HPCC, xAI → single-socket.
Recognize where PFC + DCQCN still wins — 4K–16K GPUs, single-vendor stack, no time to write your own transport.
Read the UEC roadmap — UEC 1.0 (2025), 1.1, and which NICs (CX-9, Thor3, Pollara 400) ship UEC support.

Why PFC is going away

For 10 years, the canonical recipe for "how do I run RoCE at scale?" was:

Mark with DSCP
Reserve a lossless queue with PFC
Tune ECN + DCQCN to keep PFC almost-never-firing
Hope nothing deadlocks

This still works at 4K–16K GPU scale. It is also fundamentally fragile, and every hyperscaler with a research budget is trying to escape it.

Failure mode	Cause	Real-world impact
Deadlock	Two switches pause each other in a cycle	Whole fabric region stops forwarding lossless traffic. Requires watchdog drops to recover.
Head-of-line blocking	One slow flow pauses an entire priority class	Innocent flows on the same priority freeze for tens of microseconds at a time.
PFC storms	Cascading PAUSE upstream	One bad receiver can backpressure 100 senders. Tail latency explodes.
All-or-nothing	PAUSE = 100% stop, RESUME = 100% go	No graceful rate control. Throughput oscillates badly.
L2 scope	PAUSE doesn't cross routed boundaries	Designs that want L3-routable lossless can't rely on PFC alone.
Headroom math	Buffer reservation grows with RTT × link speed	At 400G with 1 km cables, headroom alone consumes meaningful switch SRAM. At 800G, worse.

PFC is a 2008-era solution. It was good enough when "scale" meant a 100-node iWARP cluster. At 100,000 GPUs on 400G/800G fabrics, every one of these failure modes becomes a real production incident.

Anti-pattern

Assuming "if it works at 16K it'll work at 100K." PFC + DCQCN at 16K GPUs is a tunable, well-understood stack. At 100K GPUs the same stack is a deadlock farm — headroom math blows up, PFC storms cascade across pods, and one slow receiver backpressures hundreds of senders. Pick your escape route before you sign the procurement order, not after the first PFC storm.

The field map at a glance

This is the at-a-glance map — read it first, then the sections below detail each stack.

Hyperscaler	Escape route from PFC	Transport	Congestion control	Switch / NIC	Scale	Status
Meta	Cell-switched fabric (VOQ + credits)	DSF cells, proprietary	Credit-based, no overshoot	Broadcom + Cisco / multi-vendor (Thor, Marvell, CX-7)	~90K GPU L3 region	In production (Oct 2025)
Google	Custom transport + delay-based CC	Falcon (HW RDMA) / RoCEv2 over Swift	Swift (RTT-based) + Falcon HW	Broadcom + Aquila / NVIDIA + Google custom	100K+ at peak	In production
NVIDIA	Adaptive routing on RoCEv2	RoCEv2 + Spectrum-X "RoCE+"	DCQCN, tightly NIC/switch-coupled	Spectrum-4 or Broadcom / CX-7, CX-8	DGX SuperPOD reference	Shipping reference
Microsoft	Stays on PFC, tunes it hard	RoCEv2 (standard)	DCQCN, aggressive 400G tuning	Broadcom (SONiC) / NVIDIA CX-7	100K+ across Azure	In production
AWS	Custom transport, no PFC	SRD (not RoCEv2), via EFA	End-host pacing, in-NIC re-tx	AWS Annapurna / Nitro NIC	Largest in cloud	In production since 2018
Alibaba	In-network telemetry CC	RoCEv2 + INT headers	HPCC (INT-based), near-zero PFC	Broadcom INT-capable / NVIDIA CX-6/7	~100K	In production
xAI	Adaptive routing, single-socket boxes	RoCEv2 + Spectrum-X	DCQCN + Spectrum-X	NVIDIA Spectrum-4 / CX-7	~200K at Memphis Colossus	In production
UEC (standard)	Open post-PFC standard	UET (replaces RoCEv2)	End-host pacing, multi-signal	Any open silicon / CX-9, Thor3, Pollara 400	Target: 1M GPUs	UEC 1.0 out, 1.1 mid-2026

The four escape architectures

Every hyperscaler now ships some variant of one of these:

Escape 1 — Smarter end-host CC, no PFC needed (delay-based or telemetry-based control)

Google Swift (RTT-based)
Alibaba HPCC (INT telemetry stamped by switches)
UEC (industry standard, multi-signal)

Escape 2 — Replace IP/Ethernet with a cell-switched fabric (VOQ + per-flow scheduling + credit-based, like InfiniBand)

Meta DSF (Data center Scale Fabric)
Cisco SiliconOne (some deployments)
Broadcom Jericho3-AI

Escape 3 — Replace RoCEv2 with a custom transport (run RDMA on your own reliable protocol, not on PFC)

AWS EFA / SRD (Scalable Reliable Datagram)
Google Falcon (hardware RDMA transport)
UEC's UET (Ultra Ethernet Transport)

Escape 4 — Keep RoCEv2, add adaptive routing + tighter NIC/switch coupling

NVIDIA Spectrum-X (adaptive routing, packet spraying, per-flow congestion isolation)

The rest of this page walks each hyperscaler's stack and points out the surprising choices.

Swift (Google, SIGCOMM 2020)

Idea: End-to-end delay is the congestion signal. No switch help, no PFC, no ECN required.

   Sender                    Network                    Receiver
   ───────                   ────────                   ─────────
   t0: send packet                                    t1: receive
        ←─────── ACK ──── delay = t2 - t0 ───────→  t2: ack sent

   Sender computes:
       fabric_delay = ACK_delay - target_delay
       if fabric_delay rising → cut rate
       if fabric_delay falling → grow rate

Two-loop control. Fabric delay (network queue depth) is one signal; endpoint delay (NIC + host stack) is a separate signal. Sender reacts independently to each, which lets it distinguish "switch queue filling" from "receiver overloaded."

Why it works: modern NICs timestamp packets in hardware to roughly 10 ns precision. You can measure RTT precisely enough to see queueing buildup before the queue overflows. By the time DCQCN's CNP would have fired, Swift's sender has already backed off.

Result: 50 µs p99 latency at 100 Gbps under heavy load. No PFC events. Runs over commodity Ethernet.

Catch: requires hardware NIC timestamping (Mellanox CX-5+, Intel E810, Google's own silicon). And it's an end-host change — every sender in the fabric has to speak Swift.

Falcon (Google, SIGCOMM 2023)

Idea: don't run RoCEv2 at all. Build a hardware RDMA transport purpose-designed for the workload.

Falcon sits where RoCEv2 sits in your stack, but:

Reliable delivery is in hardware (custom silicon, like CX-7 but Google's design — also shipping in the Intel E2100 IPU)
Congestion control is Swift-style (delay-based), built into the NIC
No PFC required — uses end-host packet pacing instead
Packet ordering is relaxed — out-of-order delivery handled by the transport, not by the network
Supports both RDMA semantics (verbs) and message-passing semantics on the same transport

Falcon is what runs in Google's GPU fabric. The TPU pods use a different proprietary fabric (Aquila + ICI).

The Ultra Ethernet Consortium spec borrows heavily from Falcon's ideas — out-of-order delivery, end-host pacing, no PFC. UET is, in many ways, the open-standards version of Falcon.

HPCC (Alibaba, SIGCOMM 2019)

Idea: switches embed per-hop queue depth + utilization directly into packet headers (INT — In-band Network Telemetry). Sender reads this and computes the perfect rate.

   Sender ────packet [no INT data]────→ Switch1
                                          │
                                          │ stamp queue depth + util
                                          ↓
              ←────packet [INT vector]── Switch2
                                          │
                                          │ stamp queue depth + util
                                          ↓
                                       Receiver
              ←────ACK with INT vector────

   Sender sees: max queue depth across path = 30%
                max link util across path = 65%
   → compute optimal rate in <1 RTT, no overshoot

Where DCQCN takes many CNPs to converge to the right rate (slow-start AIMD), HPCC sets the right rate in one round-trip because it has precise telemetry.

Result: sub-millisecond convergence. Near-zero PFC events. Better tail latency than DCQCN.

Catch: requires INT-capable switches. Tomahawk-3 and later support this. Older fabrics can't speak it.

Meta DSF — Data center Scale Fabric (announced Oct 2025)

Idea: throw out the Ethernet packet-switching model entirely for the AI fabric. Replace with a cell-switched, virtual-output-queued, credit-based fabric — basically InfiniBand wearing Ethernet clothes.

   Traditional Ethernet (per-packet, ECMP-routed)
   ──────────────────────────────────────────────
   Sender → switch decides hop-by-hop where to send
   Each hop independently picks next link via hash
   Queue buildup → PFC → potential deadlock

   Meta DSF (cell-switched, end-to-end scheduled)
   ──────────────────────────────────────────────
   Sender's NIC chops the packet into fixed-size CELLS
   Each cell is independently scheduled across the fabric
   Receiver reassembles cells in order
   No buffer buildup → no PFC needed
   Cells get distributed across ALL paths simultaneously (perfect ECMP)

The pieces:

Cells — fixed-size chunks (typically 256–512 bytes), not variable packets
VOQ (Virtual Output Queue) — sender holds cells in per-destination queues; receiver pulls when ready
Credit-based flow control — like InfiniBand. Receiver grants credits, sender only sends what it has credit for. Zero overshoot, ever
OCP-SAI + FBOSS — Meta's open switch software stack drives the fabric
Scale claim — 18K GPUs in a single L2 zone, ~90K GPUs in an L3 region

The kicker: this is very close to how Cray's Slingshot and NVIDIA's NVLink scale-up fabrics work. Meta is essentially saying "Ethernet at AI scale needs to become InfiniBand-shaped while keeping the open ecosystem."

Ultra Ethernet Consortium (UEC)

UEC is the OCP/Linux Foundation-backed effort to standardize the post-PFC era of Ethernet for AI. Founders include AMD, Broadcom, Cisco, HPE, Intel, Meta, Microsoft, NVIDIA (late joiner), Oracle.

UEC 1.0 (published 2024):

UET (Ultra Ethernet Transport) — replaces RoCEv2 as the wire protocol
End-host packet pacing — mandatory; no PFC reliance
Out-of-order delivery — like Falcon; receiver reorders in hardware
Multi-path transmission — a single flow uses multiple paths simultaneously (packet spraying)
Backward-compatible substrate — UET runs on standard Ethernet switches; it's an endpoint change

UEC is the shared bet of everyone who isn't Google or Meta. NVIDIA's ConnectX-9 and Spectrum-5 will speak UEC. Broadcom Thor3 + Tomahawk-6 already do. Microsoft is on the standards committee. UEC 1.1, expected mid-2026, finalizes the on-the-wire spec.

Per-vendor deep-dives

Meta — DSF, FBOSS, multi-vendor NICs

Public stance: "The AI fabric needs a redesign. Standard L3 Clos + RoCE doesn't scale to our footprint."

Layer	What Meta runs
Switch OS	FBOSS (open, in-house, forked from earlier work in 2015)
Switch ASIC	Broadcom Tomahawk family + Cisco SiliconOne
Fabric architecture	DSF: cells + VOQ + credit-based CC
Lossless mechanism	Credit-based (no PFC at the fabric)
AI NIC vendors	Multi-vendor: Broadcom Thor, Marvell Octeon, NVIDIA CX-7 in some zones
Topology scale	18K GPUs L2 zone, ~90K L3 region
Older designs	Standard RoCE on rail-Clos (still in production for non-AI)

Why multi-vendor NICs: Meta's volume justifies dual-sourcing for cost negotiation. It also de-risks against a single-vendor outage or supply shock.

Why FBOSS over SONiC: Meta forked first (2015). SONiC came later (Microsoft, 2017). FBOSS is more deeply integrated with Meta's network management plane, and switching costs are now prohibitive.

The surprise: DSF is a generation ahead of standard rail-Clos + PFC. Meta is publicly betting that 100K+ GPU fabrics can't be done well on packet-switched Ethernet, and is putting cell-switched VOQ into production to prove it.

Google — Swift, Falcon, Aquila, custom everything

Public stance: "Why use standard RoCEv2 when we can build our own transport that's strictly better?"

Layer	What Google runs
General DC RDMA CC	Swift (delay-based, no PFC)
AI fabric transport	Falcon (custom HW RDMA transport, also shipping on Intel E2100 IPU)
ECMP collision avoidance	PLB (Protective Load Balancing) — switch-side, tracks ECMP path utilization and re-routes long flows when they collide
Switch OS	Custom (Stratum + their own management)
Switch ASIC	Mostly Broadcom + Google custom (Aquila for TPU fabrics)
GPU fabric	NVLink for scale-up (DGX), Falcon for scale-out
TPU fabric	Custom 3D-torus optical via Aquila + ICI

PLB is worth a deeper look. Standard ECMP just hashes the 5-tuple and lives with whatever collisions result. PLB measures path utilization and re-routes when it sees collision. It's the standards-track equivalent of Arista's DLB.

The Aquila chip is Google's custom AI fabric ASIC for TPU pods. It speaks a torus-style routing protocol, not standard Ethernet at all. Out of scope for RoCE-land, but worth knowing it exists.

The surprise: Google built Falcon because they own the entire stack down to the application code (Pathways, JAX). They can afford to break NIC-side ecosystem compatibility because their training framework is in-house. Almost no one else can.

NVIDIA reference — DGX SuperPOD + Spectrum-X + ConnectX

Public stance: "Here's the reference design that works. Everyone else is some variant of this."

Layer	NVIDIA reference (DGX SuperPOD)
GPU	H100 / B100 / B200 / B300
Scale-up fabric	NVLink (NV18 NVSwitch, 900 GB/s per GPU)
Scale-out fabric NIC	ConnectX-7 (400G) / ConnectX-8 (800G)
Switch	Spectrum-4 (NVIDIA) or merchant (Arista/Cisco/Dell on Broadcom)
Switch OS	Cumulus Linux (NVIDIA) or SONiC
Transport	RoCEv2 with DCQCN + PFC
Collective lib	NCCL + DOCA-OFED
Recommended topology	Rail-optimized Clos (3-tier for >2K GPUs)

Spectrum-X is NVIDIA's competing fabric to Falcon/DSF. It's their answer to "everyone else is fleeing PFC." Spectrum-X adds:

Adaptive routing (packet spraying across all paths)
Per-flow congestion isolation
Tighter DCQCN coupling between NIC and switch
"RoCE+", a slight variant on standard RoCEv2

The surprise: NVIDIA isn't fleeing PFC — they're doubling down on it but making the NIC+switch tightly coupled so PFC almost never has to fire. The bet is that the operational simplicity of "still looks like RoCEv2" beats Falcon/DSF's clean-room redesigns for the typical buyer.

xAI's Colossus reportedly uses Spectrum-X end-to-end.

xAI — single-socket boxes, brutally aggressive scale

Public stance (from public talks and papers by Igor Babuschkin, Greg Yang, and others): "Optimize for scale, latency, and operational simplicity. Cut anything that gets in the way."

Choice	xAI approach
Server form factor	Single-socket boxes (avoid NUMA crossings)
GPU count	H100 / H200 / B200, scaled to ~200K GPUs at Memphis Colossus
Fabric	Reportedly Spectrum-X at Memphis
NIC	ConnectX-7 / Spectrum-X coupled
Cooling	Liquid-cooled rear-door heat exchangers
Network ops team	Tiny — rumored ~10 people for the entire Colossus network
Time-to-deploy	Memphis built in months, not quarters

xAI runs single-socket because they refuse to pay the NUMA tax. Dual-socket boxes are cheaper per GPU but introduce NUMA-pinning complications — every GPU/NIC has to be mapped to the right CPU socket, every collective has to respect the topology, every misconfiguration costs throughput. Single-socket eliminates the problem entirely. Every GPU and every NIC is on one CPU. Software complexity drops.

The trade: single-socket boxes have less aggregate CPU and RAM per GPU. If your training workload is GPU-bound (most are), the CPU savings don't matter. If you're doing heavy data-loader work or CPU-side preprocessing, single-socket starts to bottleneck.

The surprise: xAI moved fast. They built ~100K GPUs in Memphis in under a year by ruthlessly cutting anything that wasn't on the throughput-per-week critical path. The single-socket bet is a great example of choosing operational simplicity over per-rack cost optimization.

Microsoft Azure — the canonical DCQCN-at-400G shop

Public stance: "Standard RoCEv2 + DCQCN works, but you have to tune it precisely at every generation."

Layer	What Azure runs
Switch ASIC	Mostly Broadcom (Tomahawk-3/4)
Switch OS	SONiC (Microsoft authored it!)
GPU instance type	ND H100 v5, ND H200 v6, etc.
SR-IOV	Yes — every VM gets a passthrough VF
NIC	ConnectX-7 (CX-6 in older SKUs)
Transport	RoCEv2 + DCQCN + PFC
Tuning	Published, aggressive 400G tuning

The Microsoft DCQCN papers are the canonical references for tuning DCQCN at 400G. The headline insights:

ECN Kmin/Kmax need to scale with link bandwidth — at 400G, target Kmin ≈ 1 MB, Kmax ≈ 5 MB
DCQCN's Rai (additive increase rate) is too slow at 400G defaults — needs to be larger
T_active and T_hai (the "how long without congestion before speeding up" timers) need to be shorter at 400G — congestion signals arrive faster
CNP rate-limiting on receivers is critical to prevent CNP floods during PFC events

The surprise: Microsoft owns SONiC and the canonical DCQCN tuning playbook. At 100K+ GPUs across Azure, they have proven that standard RoCEv2 can be made to work — provided you treat DCQCN tuning as a per-generation engineering exercise, not a one-time config.

AWS — EFA, SRD, custom Annapurna silicon

Public stance: "RoCEv2 is fine for everyone else. We have custom silicon, so we're going to build something better."

Layer	What AWS runs
NIC	Annapurna Nitro NIC (custom AWS silicon)
RDMA library	EFA (Elastic Fabric Adapter) — exposes libfabric, not verbs
Transport	SRD (Scalable Reliable Datagram) — NOT RoCEv2
Multi-path	SRD sprays packets across multiple paths natively
Ordering	SRD allows out-of-order delivery to the app
Lossless	No PFC required — SRD handles re-transmits in-NIC at sub-µs
Per-VM passthrough	Yes — every EC2 instance gets a dedicated EFA
NCCL plugin	aws-ofi-nccl translates NCCL → libfabric → EFA

SRD's design is worth understanding because it's the clearest preview of where the industry is going.

   Traditional RoCEv2 + ECMP
   ─────────────────────────
   Flow F  →  hash(5-tuple)  →  ALWAYS path A
   Path A congested → flow F stalls
   Other paths (B, C, D) sit idle
   Per-flow ordering preserved (good)
   Per-flow throughput limited to one path's capacity

   SRD (AWS)
   ─────────
   Flow F  →  packets sprayed across paths A, B, C, D
   Each packet takes the least-congested path
   Receiver reassembles out-of-order packets
   Per-flow throughput = SUM of all paths' available capacity
   App may see out-of-order delivery (must tolerate)

This is the same idea as UEC's multi-path mode and Meta DSF's cell spraying. SRD shipped years before UEC standardized it.

The catch: EFA isn't compatible with standard libibverbs. NCCL works via the aws-ofi-nccl plugin which translates NCCL → libfabric → EFA. Most code that "just works" on RoCEv2 needs a recompile to use EFA. AWS bears that ecosystem cost willingly because they control the platform.

The surprise: AWS happily broke verbs API compatibility to ship SRD in 2018. They were the first hyperscaler to fully exit RoCEv2, and SRD has been in production at the largest cloud scale for years before UEC even existed.

Alibaba — HPCC + in-network telemetry

Public stance: "Switches know more about congestion than endpoints. Let them help."

Layer	What Alibaba runs
Transport	RoCEv2 (largely)
CC	HPCC (INT-based)
Lossless	Near-zero PFC events claimed in production
Switch ASIC	Tomahawk 3/4 + their own Hanguang silicon for inference fabric
Switch OS	Mixed; some SONiC, some proprietary
Cloud RDMA	eRDMA for tenant RDMA in VPC

HPCC details (from the SIGCOMM 2019 paper):

Switches stamp per-hop queue depth, link utilization, and timestamp into INT-capable headers
Receiver echoes the INT vector back in the ACK
Sender's algorithm uses INT data to compute "optimal window size" directly from the most-congested hop on path
Convergence: 1 RTT to the optimal rate (vs DCQCN's 5–10 RTTs)
Production at Alibaba: PFC pause counters drop by orders of magnitude vs DCQCN baseline

The surprise: HPCC is the proof point that INT-based CC works in production at scale. It pre-dates Falcon and Swift by years. UEC's multi-signal CC is in many ways an industrial generalization of HPCC's design philosophy.

Anthropic — pragmatic, less-documented, multi-cloud + Trainium

Public stance (mostly inferred — Anthropic publishes less than peers): "Use the best available silicon, optimize the application stack, treat the network as infrastructure not differentiation."

Layer	What Anthropic uses
GPU types	NVIDIA H100, B200 + AWS Trainium2/3
Where the compute lives	Heavily on AWS (EFA + Trainium) + GCP (NCCL + RoCE)
Network stack on AWS workloads	EFA + SRD via aws-ofi-nccl
Network stack on GCP workloads	Standard NCCL + RoCEv2 (via GCP A3 ultra instances)
Training framework	Internal (Claude training stack; JAX for some, PyTorch for others)

What we can infer:

Anthropic's network engineers spend more time on storage I/O, dataloader pipelines, and checkpoint replication than on RoCE tuning. The network largely "just works" because they ride cloud providers' battle-tested fabrics.
They have strong opinions on interpretability of failures — they favor systems where every failure mode can be diagnosed end-to-end. That biases them toward NCCL + RoCEv2 (well-known failure modes) over EFA (newer, harder to debug from the application side) — except where AWS economics force EFA.
Their public emphasis on safety and reliability of training translates to networking decisions favoring conservatism over peak throughput. Train Claude correctly at 95% of theoretical bandwidth; never train Claude unreliably at 110%.

The surprise: Anthropic publishes less about networking than Google or Meta because much of their differentiation is in how they use the network, not in custom network hardware. They're a vendor-fabric-characterization shop, not a custom-fabric-design shop.

The big comparison matrix

Side-by-side, the entire design space:

Org	Transport	CC algo	Lossless mechanism	Multi-path	Switch silicon	NIC vendor	AI scale
Meta DSF (Oct 2025)	Cell-switched proprietary	Credit-based	NO PFC (credits)	Per-cell spray (perfect ECMP)	Broadcom + Cisco (FBOSS)	Multi-vendor: Broadcom Thor, Marvell, NVIDIA	~90K L3 region
Google Swift + Falcon	RoCEv2 over Swift OR custom Falcon	Delay-based + Falcon HW	NO PFC (host pacing)	PLB (ECMP+)	Broadcom + Google Aquila	NVIDIA + Google custom (Falcon HW)	100K+ at peak
NVIDIA reference	RoCEv2 + Spectrum-X RoCE+	DCQCN (tunable)	PFC + ECN	Spectrum-X adaptive routing	Spectrum-4 or Broadcom	CX-7 / CX-8	DGX SuperPOD reference
xAI Colossus	RoCEv2 + Spectrum-X	DCQCN + Spectrum-X	PFC + ECN	Spectrum-X adaptive routing	NVIDIA Spectrum-4	NVIDIA CX-7	~200K GPUs
Microsoft Azure	RoCEv2 (standard)	DCQCN, 400G-tuned	PFC + ECN	5-tuple ECMP	Broadcom (SONiC)	NVIDIA CX-7	100K+ across Azure
AWS EFA	SRD (custom)	End-host pacing	NO PFC (host re-tx)	Per-packet spray	AWS custom Annapurna	AWS Nitro / Annapurna	Largest in cloud
Alibaba HPCC	RoCEv2 + INT headers	HPCC (INT-based)	Near-zero PFC	5-tuple ECMP	Broadcom (INT-capable)	NVIDIA CX-6/7	~100K
Anthropic	Mix: EFA (AWS) + RoCEv2 (GCP/own iron) + Trainium fabric	Inherits from cloud	Inherits from cloud	Inherits from cloud	Inherits from cloud	Mix	Renting at hyperscale
UEC future (industry)	UET (replaces RoCEv2)	End-host pacing + multi-signal	NO PFC (pacing)	Per-packet spray (default)	Any (open standard)	Any (open standard)	Target: 1M GPUs

Four patterns jump out:

PFC is on the way out. Every "future" design (DSF, Falcon, EFA, UEC) is PFC-free. The DCQCN+PFC stack is the conservative, well-understood option that buys time until UEC arrives.
Multi-path is the new default. Everyone except the pure-DCQCN shops (Azure, Alibaba's RoCE side) is moving toward per-packet or per-cell spraying. Single-path 5-tuple ECMP is the relic.
NVIDIA NIC dominance is real but cracking. AWS uses Annapurna. Meta runs multi-vendor (Broadcom Thor, Marvell, NVIDIA). Google has its own silicon (Falcon HW, Aquila). Microsoft Azure and xAI are still pure NVIDIA-NIC shops.
Custom transport is the differentiator. Google (Falcon), AWS (SRD), Meta (DSF), and eventually UEC all replace the wire protocol. RoCEv2 is the laggard — well-loved, but technically the oldest design in the matrix.

The UEC / NIXL / Pensando / Thor3 roadmap

Three near-term shifts on a 6–24 month horizon.

Ultra Ethernet Consortium (UEC) — 12–24 month horizon

UEC 1.0 published. UEC 1.1, expected mid-2026, finalizes the on-the-wire spec.

NIC silicon supporting UEC:

NVIDIA ConnectX-9 (rumored 2026) — first NVIDIA UEC NIC
Broadcom Thor3 (shipping 2025) — already UEC-aware
AMD Pensando Pollara 400 (2024–2025) — UEC-first design

Switch silicon supporting UEC:

Broadcom Tomahawk-6 — first UEC-class switching ASIC
NVIDIA Spectrum-5 — likely UEC-compatible alongside RoCE+

What changes when UEC arrives:

DCQCN goes away. Replaced by UET's end-host pacing.
PFC dependency eliminated. Switches won't need lossless queue config in the same way.
ECMP collision becomes a non-issue (UET sprays packets natively).
Per-host source routing becomes mostly irrelevant — UET handles per-flow source routing implicitly.

This is a host-stack rewrite. The way you reason about a RoCEv2 fabric does not survive UEC adoption.

NVIDIA NIXL — 6–12 month horizon

NIXL = NVIDIA Inference eXchange Library. Newer collective library targeting mixed-precision communication (FP16/FP8 gradients, INT8 weights). Aimed at the B300 era.

The idea: when your gradient is FP16 and your weight is INT8, why send both at full FP32 precision over the wire? NIXL does precision-aware compression in-NIC.

Becomes relevant when training shifts to FP8 / INT4 (Llama-style efficiency). Not urgent in 2026; important by 2027.

AMD Pensando + AMD Pollara

AMD bought Pensando in 2022. The Pollara 400 NIC is their first AI-targeting SmartNIC. It's UEC-native, supports both RDMA and SmartNIC functions (security offload, encryption, telemetry).

The relevance: AMD as a second-source NIC is increasingly viable. Even for shops that don't switch from NVIDIA, having a credible alternative in the procurement conversation is worth real money.

Intel IPU + Intel Mount Evans

Intel's SmartNIC line (Mount Evans, formerly Tofino) is currently weaker than NVIDIA/AMD in AI fabric features. Unlikely to be a real contender for B300-generation deployments; watch the next-gen. Note that Intel's E2100 IPU is the silicon Google's Falcon transport runs on, which is a meaningful Falcon-ecosystem signal.

Broadcom Thor3 / Thor4

Already shipping in production at Meta and others. Native UEC, multi-path, INT support. The real choice for any operator that wants to escape single-vendor NIC dependency without going to AMD.

💡 What you should remember

#		Concept	Why it matters
1	🚪	Four escape routes from PFC	Smarter end-host CC (Swift, HPCC), cell-switched fabrics (Meta DSF), custom transports (Falcon, SRD, UET), adaptive routing on RoCEv2 (Spectrum-X).
2	📏	PFC + DCQCN works at 4K–16K GPUs	Cracks at 32K+, breaks at 100K+. Every 100K+ deployment has escaped, built around, or paired it with adaptive routing.
3	🛠️	Microsoft proves DCQCN can scale	But only as a per-generation engineering exercise. 400G tuning is not a one-time config — it's a quarterly project.
4	🌊	AWS SRD is the preview of where it's going	Packet spraying + out-of-order delivery + no PFC + hardware reassembly. UEC is the open-standards version of the same idea.
5	🎯	xAI's single-socket bet is underrated	Eliminate NUMA. Worth more than the CPU/RAM you give up — if the workload is GPU-bound (most AI training is).
6	🔄	UEC is a host-stack rewrite	Not a config change. When CX-9 / Thor3 / Pollara ship in volume, DCQCN-tuning partially obsoletes and UET-pacing begins.
7	🗺️	The matrix in one line	NVIDIA evolves RoCEv2 (Spectrum-X) · Google + AWS replaced it (Falcon, SRD) · Meta replaced the fabric (DSF) · Alibaba + Microsoft squeeze the last out · UEC is where everyone meets in 2027.

Next: Switch QoS → — the switch-side configuration that makes any of these stacks actually deliver. DSCP-to-TC mapping, ECN watermarks, PFC headroom, buffer carving — the knobs you turn whether you're running plain RoCEv2 + DCQCN or paired with adaptive routing.

Why PFC is going away​

The field map at a glance​

The four escape architectures​

Swift (Google, SIGCOMM 2020)​

Falcon (Google, SIGCOMM 2023)​

HPCC (Alibaba, SIGCOMM 2019)​

Meta DSF — Data center Scale Fabric (announced Oct 2025)​

Ultra Ethernet Consortium (UEC)​

Per-vendor deep-dives​

Meta — DSF, FBOSS, multi-vendor NICs​

Google — Swift, Falcon, Aquila, custom everything​

NVIDIA reference — DGX SuperPOD + Spectrum-X + ConnectX​

xAI — single-socket boxes, brutally aggressive scale​

Microsoft Azure — the canonical DCQCN-at-400G shop​

AWS — EFA, SRD, custom Annapurna silicon​

Alibaba — HPCC + in-network telemetry​

Anthropic — pragmatic, less-documented, multi-cloud + Trainium​

The big comparison matrix​

The UEC / NIXL / Pensando / Thor3 roadmap​

Ultra Ethernet Consortium (UEC) — 12–24 month horizon​

NVIDIA NIXL — 6–12 month horizon​

AMD Pensando + AMD Pollara​

Intel IPU + Intel Mount Evans​

Broadcom Thor3 / Thor4​

💡 What you should remember​