Skip to main content

Hyperscaler RoCE Stacks

The previous pages laid out the design space — transports, congestion control algorithms, the curriculum's pick. This page is the field map: what every major hyperscaler actually ships in production, and why.

If you sat down with a network engineer from Meta or AWS tomorrow, half their vocabulary would be different. They'd talk about VOQ, cells, Falcon, EFA, SRD, INT, Swift, DSF, UEC. They aren't doing magic. They're just at a different point on the same design space, optimizing for a different scale, a different cost curve, a different customer.

By the end of this page you should know what each hyperscaler ships, why they ship it, and where the industry is heading.


Why PFC is going away

For 10 years, the canonical recipe for "how do I run RoCE at scale?" was:

  1. Mark with DSCP
  2. Reserve a lossless queue with PFC
  3. Tune ECN + DCQCN to keep PFC almost-never-firing
  4. Hope nothing deadlocks

This still works at 4K–16K GPU scale. It is also fundamentally fragile, and every hyperscaler with a research budget is trying to escape it.

Failure modeCauseReal-world impact
DeadlockTwo switches pause each other in a cycleWhole fabric region stops forwarding lossless traffic. Requires watchdog drops to recover.
Head-of-line blockingOne slow flow pauses an entire priority classInnocent flows on the same priority freeze for tens of microseconds at a time.
PFC stormsCascading PAUSE upstreamOne bad receiver can backpressure 100 senders. Tail latency explodes.
All-or-nothingPAUSE = 100% stop, RESUME = 100% goNo graceful rate control. Throughput oscillates badly.
L2 scopePAUSE doesn't cross routed boundariesDesigns that want L3-routable lossless can't rely on PFC alone.
Headroom mathBuffer reservation grows with RTT × link speedAt 400G with 1 km cables, headroom alone consumes meaningful switch SRAM. At 800G, worse.

PFC is a 2008-era solution. It was good enough when "scale" meant a 100-node iWARP cluster. At 100,000 GPUs on 400G/800G fabrics, every one of these failure modes becomes a real production incident.


The four escape architectures

Every hyperscaler now ships some variant of one of these:

Escape 1 — Smarter end-host CC, no PFC needed (delay-based or telemetry-based control)

  • Google Swift (RTT-based)
  • Alibaba HPCC (INT telemetry stamped by switches)
  • UEC (industry standard, multi-signal)

Escape 2 — Replace IP/Ethernet with a cell-switched fabric (VOQ + per-flow scheduling + credit-based, like InfiniBand)

  • Meta DSF (Data center Scale Fabric)
  • Cisco SiliconOne (some deployments)
  • Broadcom Jericho3-AI

Escape 3 — Replace RoCEv2 with a custom transport (run RDMA on your own reliable protocol, not on PFC)

  • AWS EFA / SRD (Scalable Reliable Datagram)
  • Google Falcon (hardware RDMA transport)
  • UEC's UET (Ultra Ethernet Transport)

Escape 4 — Keep RoCEv2, add adaptive routing + tighter NIC/switch coupling

  • NVIDIA Spectrum-X (adaptive routing, packet spraying, per-flow congestion isolation)

The rest of this page walks each hyperscaler's stack and points out the surprising choices.


Swift (Google, SIGCOMM 2020)

Idea: End-to-end delay is the congestion signal. No switch help, no PFC, no ECN required.

Sender Network Receiver
─────── ──────── ─────────
t0: send packet t1: receive
←─────── ACK ──── delay = t2 - t0 ───────→ t2: ack sent

Sender computes:
fabric_delay = ACK_delay - target_delay
if fabric_delay rising → cut rate
if fabric_delay falling → grow rate

Two-loop control. Fabric delay (network queue depth) is one signal; endpoint delay (NIC + host stack) is a separate signal. Sender reacts independently to each, which lets it distinguish "switch queue filling" from "receiver overloaded."

Why it works: modern NICs timestamp packets in hardware to roughly 10 ns precision. You can measure RTT precisely enough to see queueing buildup before the queue overflows. By the time DCQCN's CNP would have fired, Swift's sender has already backed off.

Result: 50 µs p99 latency at 100 Gbps under heavy load. No PFC events. Runs over commodity Ethernet.

Catch: requires hardware NIC timestamping (Mellanox CX-5+, Intel E810, Google's own silicon). And it's an end-host change — every sender in the fabric has to speak Swift.

Falcon (Google, SIGCOMM 2023)

Idea: don't run RoCEv2 at all. Build a hardware RDMA transport purpose-designed for the workload.

Falcon sits where RoCEv2 sits in your stack, but:

  • Reliable delivery is in hardware (custom silicon, like CX-7 but Google's design — also shipping in the Intel E2100 IPU)
  • Congestion control is Swift-style (delay-based), built into the NIC
  • No PFC required — uses end-host packet pacing instead
  • Packet ordering is relaxed — out-of-order delivery handled by the transport, not by the network
  • Supports both RDMA semantics (verbs) and message-passing semantics on the same transport

Falcon is what runs in Google's GPU fabric. The TPU pods use a different proprietary fabric (Aquila + ICI).

The Ultra Ethernet Consortium spec borrows heavily from Falcon's ideas — out-of-order delivery, end-host pacing, no PFC. UET is, in many ways, the open-standards version of Falcon.

HPCC (Alibaba, SIGCOMM 2019)

Idea: switches embed per-hop queue depth + utilization directly into packet headers (INT — In-band Network Telemetry). Sender reads this and computes the perfect rate.

Sender ────packet [no INT data]────→ Switch1

│ stamp queue depth + util

←────packet [INT vector]── Switch2

│ stamp queue depth + util

Receiver
←────ACK with INT vector────

Sender sees: max queue depth across path = 30%
max link util across path = 65%
→ compute optimal rate in <1 RTT, no overshoot

Where DCQCN takes many CNPs to converge to the right rate (slow-start AIMD), HPCC sets the right rate in one round-trip because it has precise telemetry.

Result: sub-millisecond convergence. Near-zero PFC events. Better tail latency than DCQCN.

Catch: requires INT-capable switches. Tomahawk-3 and later support this. Older fabrics can't speak it.

Meta DSF — Data center Scale Fabric (announced Oct 2025)

Idea: throw out the Ethernet packet-switching model entirely for the AI fabric. Replace with a cell-switched, virtual-output-queued, credit-based fabric — basically InfiniBand wearing Ethernet clothes.

Traditional Ethernet (per-packet, ECMP-routed)
──────────────────────────────────────────────
Sender → switch decides hop-by-hop where to send
Each hop independently picks next link via hash
Queue buildup → PFC → potential deadlock

Meta DSF (cell-switched, end-to-end scheduled)
──────────────────────────────────────────────
Sender's NIC chops the packet into fixed-size CELLS
Each cell is independently scheduled across the fabric
Receiver reassembles cells in order
No buffer buildup → no PFC needed
Cells get distributed across ALL paths simultaneously (perfect ECMP)

The pieces:

  • Cells — fixed-size chunks (typically 256–512 bytes), not variable packets
  • VOQ (Virtual Output Queue) — sender holds cells in per-destination queues; receiver pulls when ready
  • Credit-based flow control — like InfiniBand. Receiver grants credits, sender only sends what it has credit for. Zero overshoot, ever
  • OCP-SAI + FBOSS — Meta's open switch software stack drives the fabric
  • Scale claim — 18K GPUs in a single L2 zone, ~90K GPUs in an L3 region

The kicker: this is very close to how Cray's Slingshot and NVIDIA's NVLink scale-up fabrics work. Meta is essentially saying "Ethernet at AI scale needs to become InfiniBand-shaped while keeping the open ecosystem."

Ultra Ethernet Consortium (UEC)

UEC is the OCP/Linux Foundation-backed effort to standardize the post-PFC era of Ethernet for AI. Founders include AMD, Broadcom, Cisco, HPE, Intel, Meta, Microsoft, NVIDIA (late joiner), Oracle.

UEC 1.0 (published 2024):

  • UET (Ultra Ethernet Transport) — replaces RoCEv2 as the wire protocol
  • End-host packet pacing — mandatory; no PFC reliance
  • Out-of-order delivery — like Falcon; receiver reorders in hardware
  • Multi-path transmission — a single flow uses multiple paths simultaneously (packet spraying)
  • Backward-compatible substrate — UET runs on standard Ethernet switches; it's an endpoint change

UEC is the shared bet of everyone who isn't Google or Meta. NVIDIA's ConnectX-9 and Spectrum-5 will speak UEC. Broadcom Thor3 + Tomahawk-6 already do. Microsoft is on the standards committee. UEC 1.1, expected mid-2026, finalizes the on-the-wire spec.


Per-vendor deep-dives

Meta — DSF, FBOSS, multi-vendor NICs

Public stance: "The AI fabric needs a redesign. Standard L3 Clos + RoCE doesn't scale to our footprint."

LayerWhat Meta runs
Switch OSFBOSS (open, in-house, forked from earlier work in 2015)
Switch ASICBroadcom Tomahawk family + Cisco SiliconOne
Fabric architectureDSF: cells + VOQ + credit-based CC
Lossless mechanismCredit-based (no PFC at the fabric)
AI NIC vendorsMulti-vendor: Broadcom Thor, Marvell Octeon, NVIDIA CX-7 in some zones
Topology scale18K GPUs L2 zone, ~90K L3 region
Older designsStandard RoCE on rail-Clos (still in production for non-AI)

Why multi-vendor NICs: Meta's volume justifies dual-sourcing for cost negotiation. It also de-risks against a single-vendor outage or supply shock.

Why FBOSS over SONiC: Meta forked first (2015). SONiC came later (Microsoft, 2017). FBOSS is more deeply integrated with Meta's network management plane, and switching costs are now prohibitive.

The surprise: DSF is a generation ahead of standard rail-Clos + PFC. Meta is publicly betting that 100K+ GPU fabrics can't be done well on packet-switched Ethernet, and is putting cell-switched VOQ into production to prove it.

Google — Swift, Falcon, Aquila, custom everything

Public stance: "Why use standard RoCEv2 when we can build our own transport that's strictly better?"

LayerWhat Google runs
General DC RDMA CCSwift (delay-based, no PFC)
AI fabric transportFalcon (custom HW RDMA transport, also shipping on Intel E2100 IPU)
ECMP collision avoidancePLB (Protective Load Balancing) — switch-side, tracks ECMP path utilization and re-routes long flows when they collide
Switch OSCustom (Stratum + their own management)
Switch ASICMostly Broadcom + Google custom (Aquila for TPU fabrics)
GPU fabricNVLink for scale-up (DGX), Falcon for scale-out
TPU fabricCustom 3D-torus optical via Aquila + ICI

PLB is worth a deeper look. Standard ECMP just hashes the 5-tuple and lives with whatever collisions result. PLB measures path utilization and re-routes when it sees collision. It's the standards-track equivalent of Arista's DLB.

The Aquila chip is Google's custom AI fabric ASIC for TPU pods. It speaks a torus-style routing protocol, not standard Ethernet at all. Out of scope for RoCE-land, but worth knowing it exists.

The surprise: Google built Falcon because they own the entire stack down to the application code (Pathways, JAX). They can afford to break NIC-side ecosystem compatibility because their training framework is in-house. Almost no one else can.

NVIDIA reference — DGX SuperPOD + Spectrum-X + ConnectX

Public stance: "Here's the reference design that works. Everyone else is some variant of this."

LayerNVIDIA reference (DGX SuperPOD)
GPUH100 / B100 / B200 / B300
Scale-up fabricNVLink (NV18 NVSwitch, 900 GB/s per GPU)
Scale-out fabric NICConnectX-7 (400G) / ConnectX-8 (800G)
SwitchSpectrum-4 (NVIDIA) or merchant (Arista/Cisco/Dell on Broadcom)
Switch OSCumulus Linux (NVIDIA) or SONiC
TransportRoCEv2 with DCQCN + PFC
Collective libNCCL + DOCA-OFED
Recommended topologyRail-optimized Clos (3-tier for >2K GPUs)

Spectrum-X is NVIDIA's competing fabric to Falcon/DSF. It's their answer to "everyone else is fleeing PFC." Spectrum-X adds:

  • Adaptive routing (packet spraying across all paths)
  • Per-flow congestion isolation
  • Tighter DCQCN coupling between NIC and switch
  • "RoCE+", a slight variant on standard RoCEv2

The surprise: NVIDIA isn't fleeing PFC — they're doubling down on it but making the NIC+switch tightly coupled so PFC almost never has to fire. The bet is that the operational simplicity of "still looks like RoCEv2" beats Falcon/DSF's clean-room redesigns for the typical buyer.

xAI's Colossus reportedly uses Spectrum-X end-to-end.

xAI — single-socket boxes, brutally aggressive scale

Public stance (from public talks and papers by Igor Babuschkin, Greg Yang, and others): "Optimize for scale, latency, and operational simplicity. Cut anything that gets in the way."

ChoicexAI approach
Server form factorSingle-socket boxes (avoid NUMA crossings)
GPU countH100 / H200 / B200, scaled to ~200K GPUs at Memphis Colossus
FabricReportedly Spectrum-X at Memphis
NICConnectX-7 / Spectrum-X coupled
CoolingLiquid-cooled rear-door heat exchangers
Network ops teamTiny — rumored ~10 people for the entire Colossus network
Time-to-deployMemphis built in months, not quarters

xAI runs single-socket because they refuse to pay the NUMA tax. Dual-socket boxes are cheaper per GPU but introduce NUMA-pinning complications — every GPU/NIC has to be mapped to the right CPU socket, every collective has to respect the topology, every misconfiguration costs throughput. Single-socket eliminates the problem entirely. Every GPU and every NIC is on one CPU. Software complexity drops.

The trade: single-socket boxes have less aggregate CPU and RAM per GPU. If your training workload is GPU-bound (most are), the CPU savings don't matter. If you're doing heavy data-loader work or CPU-side preprocessing, single-socket starts to bottleneck.

The surprise: xAI moved fast. They built ~100K GPUs in Memphis in under a year by ruthlessly cutting anything that wasn't on the throughput-per-week critical path. The single-socket bet is a great example of choosing operational simplicity over per-rack cost optimization.

Microsoft Azure — the canonical DCQCN-at-400G shop

Public stance: "Standard RoCEv2 + DCQCN works, but you have to tune it precisely at every generation."

LayerWhat Azure runs
Switch ASICMostly Broadcom (Tomahawk-3/4)
Switch OSSONiC (Microsoft authored it!)
GPU instance typeND H100 v5, ND H200 v6, etc.
SR-IOVYes — every VM gets a passthrough VF
NICConnectX-7 (CX-6 in older SKUs)
TransportRoCEv2 + DCQCN + PFC
TuningPublished, aggressive 400G tuning

The Microsoft DCQCN papers are the canonical references for tuning DCQCN at 400G. The headline insights:

  • ECN Kmin/Kmax need to scale with link bandwidth — at 400G, target Kmin ≈ 1 MB, Kmax ≈ 5 MB
  • DCQCN's Rai (additive increase rate) is too slow at 400G defaults — needs to be larger
  • T_active and T_hai (the "how long without congestion before speeding up" timers) need to be shorter at 400G — congestion signals arrive faster
  • CNP rate-limiting on receivers is critical to prevent CNP floods during PFC events

The surprise: Microsoft owns SONiC and the canonical DCQCN tuning playbook. At 100K+ GPUs across Azure, they have proven that standard RoCEv2 can be made to work — provided you treat DCQCN tuning as a per-generation engineering exercise, not a one-time config.

AWS — EFA, SRD, custom Annapurna silicon

Public stance: "RoCEv2 is fine for everyone else. We have custom silicon, so we're going to build something better."

LayerWhat AWS runs
NICAnnapurna Nitro NIC (custom AWS silicon)
RDMA libraryEFA (Elastic Fabric Adapter) — exposes libfabric, not verbs
TransportSRD (Scalable Reliable Datagram) — NOT RoCEv2
Multi-pathSRD sprays packets across multiple paths natively
OrderingSRD allows out-of-order delivery to the app
LosslessNo PFC required — SRD handles re-transmits in-NIC at sub-µs
Per-VM passthroughYes — every EC2 instance gets a dedicated EFA
NCCL pluginaws-ofi-nccl translates NCCL → libfabric → EFA

SRD's design is worth understanding because it's the clearest preview of where the industry is going.

Traditional RoCEv2 + ECMP
─────────────────────────
Flow F → hash(5-tuple) → ALWAYS path A
Path A congested → flow F stalls
Other paths (B, C, D) sit idle
Per-flow ordering preserved (good)
Per-flow throughput limited to one path's capacity

SRD (AWS)
─────────
Flow F → packets sprayed across paths A, B, C, D
Each packet takes the least-congested path
Receiver reassembles out-of-order packets
Per-flow throughput = SUM of all paths' available capacity
App may see out-of-order delivery (must tolerate)

This is the same idea as UEC's multi-path mode and Meta DSF's cell spraying. SRD shipped years before UEC standardized it.

The catch: EFA isn't compatible with standard libibverbs. NCCL works via the aws-ofi-nccl plugin which translates NCCL → libfabric → EFA. Most code that "just works" on RoCEv2 needs a recompile to use EFA. AWS bears that ecosystem cost willingly because they control the platform.

The surprise: AWS happily broke verbs API compatibility to ship SRD in 2018. They were the first hyperscaler to fully exit RoCEv2, and SRD has been in production at the largest cloud scale for years before UEC even existed.

Alibaba — HPCC + in-network telemetry

Public stance: "Switches know more about congestion than endpoints. Let them help."

LayerWhat Alibaba runs
TransportRoCEv2 (largely)
CCHPCC (INT-based)
LosslessNear-zero PFC events claimed in production
Switch ASICTomahawk 3/4 + their own Hanguang silicon for inference fabric
Switch OSMixed; some SONiC, some proprietary
Cloud RDMAeRDMA for tenant RDMA in VPC

HPCC details (from the SIGCOMM 2019 paper):

  • Switches stamp per-hop queue depth, link utilization, and timestamp into INT-capable headers
  • Receiver echoes the INT vector back in the ACK
  • Sender's algorithm uses INT data to compute "optimal window size" directly from the most-congested hop on path
  • Convergence: 1 RTT to the optimal rate (vs DCQCN's 5–10 RTTs)
  • Production at Alibaba: PFC pause counters drop by orders of magnitude vs DCQCN baseline

The surprise: HPCC is the proof point that INT-based CC works in production at scale. It pre-dates Falcon and Swift by years. UEC's multi-signal CC is in many ways an industrial generalization of HPCC's design philosophy.

Anthropic — pragmatic, less-documented, multi-cloud + Trainium

Public stance (mostly inferred — Anthropic publishes less than peers): "Use the best available silicon, optimize the application stack, treat the network as infrastructure not differentiation."

LayerWhat Anthropic uses
GPU typesNVIDIA H100, B200 + AWS Trainium2/3
Where the compute livesHeavily on AWS (EFA + Trainium) + GCP (NCCL + RoCE)
Network stack on AWS workloadsEFA + SRD via aws-ofi-nccl
Network stack on GCP workloadsStandard NCCL + RoCEv2 (via GCP A3 ultra instances)
Training frameworkInternal (Claude training stack; JAX for some, PyTorch for others)

What we can infer:

  • Anthropic's network engineers spend more time on storage I/O, dataloader pipelines, and checkpoint replication than on RoCE tuning. The network largely "just works" because they ride cloud providers' battle-tested fabrics.
  • They have strong opinions on interpretability of failures — they favor systems where every failure mode can be diagnosed end-to-end. That biases them toward NCCL + RoCEv2 (well-known failure modes) over EFA (newer, harder to debug from the application side) — except where AWS economics force EFA.
  • Their public emphasis on safety and reliability of training translates to networking decisions favoring conservatism over peak throughput. Train Claude correctly at 95% of theoretical bandwidth; never train Claude unreliably at 110%.

The surprise: Anthropic publishes less about networking than Google or Meta because much of their differentiation is in how they use the network, not in custom network hardware. They're a vendor-fabric-characterization shop, not a custom-fabric-design shop.


The big comparison matrix

Side-by-side, the entire design space:

OrgTransportCC algoLossless mechanismMulti-pathSwitch siliconNIC vendorAI scale
Meta DSF (Oct 2025)Cell-switched proprietaryCredit-basedNO PFC (credits)Per-cell spray (perfect ECMP)Broadcom + Cisco (FBOSS)Multi-vendor: Broadcom Thor, Marvell, NVIDIA~90K L3 region
Google Swift + FalconRoCEv2 over Swift OR custom FalconDelay-based + Falcon HWNO PFC (host pacing)PLB (ECMP+)Broadcom + Google AquilaNVIDIA + Google custom (Falcon HW)100K+ at peak
NVIDIA referenceRoCEv2 + Spectrum-X RoCE+DCQCN (tunable)PFC + ECNSpectrum-X adaptive routingSpectrum-4 or BroadcomCX-7 / CX-8DGX SuperPOD reference
xAI ColossusRoCEv2 + Spectrum-XDCQCN + Spectrum-XPFC + ECNSpectrum-X adaptive routingNVIDIA Spectrum-4NVIDIA CX-7~200K GPUs
Microsoft AzureRoCEv2 (standard)DCQCN, 400G-tunedPFC + ECN5-tuple ECMPBroadcom (SONiC)NVIDIA CX-7100K+ across Azure
AWS EFASRD (custom)End-host pacingNO PFC (host re-tx)Per-packet sprayAWS custom AnnapurnaAWS Nitro / AnnapurnaLargest in cloud
Alibaba HPCCRoCEv2 + INT headersHPCC (INT-based)Near-zero PFC5-tuple ECMPBroadcom (INT-capable)NVIDIA CX-6/7~100K
AnthropicMix: EFA (AWS) + RoCEv2 (GCP/own iron) + Trainium fabricInherits from cloudInherits from cloudInherits from cloudInherits from cloudMixRenting at hyperscale
UEC future (industry)UET (replaces RoCEv2)End-host pacing + multi-signalNO PFC (pacing)Per-packet spray (default)Any (open standard)Any (open standard)Target: 1M GPUs

Four patterns jump out:

  1. PFC is on the way out. Every "future" design (DSF, Falcon, EFA, UEC) is PFC-free. The DCQCN+PFC stack is the conservative, well-understood option that buys time until UEC arrives.
  2. Multi-path is the new default. Everyone except the pure-DCQCN shops (Azure, Alibaba's RoCE side) is moving toward per-packet or per-cell spraying. Single-path 5-tuple ECMP is the relic.
  3. NVIDIA NIC dominance is real but cracking. AWS uses Annapurna. Meta runs multi-vendor (Broadcom Thor, Marvell, NVIDIA). Google has its own silicon (Falcon HW, Aquila). Microsoft Azure and xAI are still pure NVIDIA-NIC shops.
  4. Custom transport is the differentiator. Google (Falcon), AWS (SRD), Meta (DSF), and eventually UEC all replace the wire protocol. RoCEv2 is the laggard — well-loved, but technically the oldest design in the matrix.

The UEC / NIXL / Pensando / Thor3 roadmap

Three near-term shifts on a 6–24 month horizon.

Ultra Ethernet Consortium (UEC) — 12–24 month horizon

UEC 1.0 published. UEC 1.1, expected mid-2026, finalizes the on-the-wire spec.

NIC silicon supporting UEC:

  • NVIDIA ConnectX-9 (rumored 2026) — first NVIDIA UEC NIC
  • Broadcom Thor3 (shipping 2025) — already UEC-aware
  • AMD Pensando Pollara 400 (2024–2025) — UEC-first design

Switch silicon supporting UEC:

  • Broadcom Tomahawk-6 — first UEC-class switching ASIC
  • NVIDIA Spectrum-5 — likely UEC-compatible alongside RoCE+

What changes when UEC arrives:

  • DCQCN goes away. Replaced by UET's end-host pacing.
  • PFC dependency eliminated. Switches won't need lossless queue config in the same way.
  • ECMP collision becomes a non-issue (UET sprays packets natively).
  • Per-host source routing becomes mostly irrelevant — UET handles per-flow source routing implicitly.

This is a host-stack rewrite. The way you reason about a RoCEv2 fabric does not survive UEC adoption.

NVIDIA NIXL — 6–12 month horizon

NIXL = NVIDIA Inference eXchange Library. Newer collective library targeting mixed-precision communication (FP16/FP8 gradients, INT8 weights). Aimed at the B300 era.

The idea: when your gradient is FP16 and your weight is INT8, why send both at full FP32 precision over the wire? NIXL does precision-aware compression in-NIC.

Becomes relevant when training shifts to FP8 / INT4 (Llama-style efficiency). Not urgent in 2026; important by 2027.

AMD Pensando + AMD Pollara

AMD bought Pensando in 2022. The Pollara 400 NIC is their first AI-targeting SmartNIC. It's UEC-native, supports both RDMA and SmartNIC functions (security offload, encryption, telemetry).

The relevance: AMD as a second-source NIC is increasingly viable. Even for shops that don't switch from NVIDIA, having a credible alternative in the procurement conversation is worth real money.

Intel IPU + Intel Mount Evans

Intel's SmartNIC line (Mount Evans, formerly Tofino) is currently weaker than NVIDIA/AMD in AI fabric features. Unlikely to be a real contender for B300-generation deployments; watch the next-gen. Note that Intel's E2100 IPU is the silicon Google's Falcon transport runs on, which is a meaningful Falcon-ecosystem signal.

Broadcom Thor3 / Thor4

Already shipping in production at Meta and others. Native UEC, multi-path, INT support. The real choice for any operator that wants to escape single-vendor NIC dependency without going to AMD.


What you should remember

  • The hyperscaler RoCE design space has four escape routes from PFC: smarter end-host CC (Swift, HPCC), cell-switched fabrics (Meta DSF), custom transports (Falcon, SRD, UET), and adaptive routing on top of RoCEv2 (Spectrum-X).
  • PFC + DCQCN works fine at 4K–16K GPUs. It starts to crack at 32K+ and breaks at 100K+. Every hyperscaler operating at 100K+ has either left RoCEv2 (AWS, Google), built around it (Meta DSF), or paired it with adaptive routing (NVIDIA Spectrum-X, xAI).
  • Microsoft is the proof point that DCQCN can scale to 100K+ — but only if you treat 400G tuning as a per-generation engineering exercise, not a one-time config.
  • AWS SRD is the cleanest preview of where the industry is going: packet spraying, out-of-order delivery, no PFC, hardware reassembly. UEC is essentially the open-standards version of the same idea.
  • xAI's single-socket bet is the most underrated decision in the matrix. Eliminating NUMA is worth more than the CPU/RAM you give up, if your workload is GPU-bound (most AI training is).
  • UEC adoption is a host-stack rewrite, not a config change. When CX-9 / Thor3 / Pollara ship in volume, the DCQCN-tuning skill set partially obsoletes and the UET-pacing skill set begins.
  • The matrix in one line: NVIDIA wants to evolve RoCEv2 (Spectrum-X); Google and AWS have already replaced it (Falcon, SRD); Meta has replaced the fabric (DSF); Alibaba and Microsoft are squeezing the last out of RoCEv2 + DCQCN; UEC is where everyone meets in 2027.

Next: Switch QoS → — the switch-side configuration that makes any of these stacks actually deliver. DSCP-to-TC mapping, ECN watermarks, PFC headroom, buffer carving — the knobs you turn whether you're running plain RoCEv2 + DCQCN or paired with adaptive routing.