Hyperscaler RoCE Stacks
The previous pages laid out the design space — transports, congestion control algorithms, the curriculum's pick. This page is the field map: what every major hyperscaler actually ships in production, and why.
If you sat down with a network engineer from Meta or AWS tomorrow, half their vocabulary would be different. They'd talk about VOQ, cells, Falcon, EFA, SRD, INT, Swift, DSF, UEC. They aren't doing magic. They're just at a different point on the same design space, optimizing for a different scale, a different cost curve, a different customer.
By the end of this page you should know what each hyperscaler ships, why they ship it, and where the industry is heading.
Why PFC is going away
For 10 years, the canonical recipe for "how do I run RoCE at scale?" was:
- Mark with DSCP
- Reserve a lossless queue with PFC
- Tune ECN + DCQCN to keep PFC almost-never-firing
- Hope nothing deadlocks
This still works at 4K–16K GPU scale. It is also fundamentally fragile, and every hyperscaler with a research budget is trying to escape it.
| Failure mode | Cause | Real-world impact |
|---|---|---|
| Deadlock | Two switches pause each other in a cycle | Whole fabric region stops forwarding lossless traffic. Requires watchdog drops to recover. |
| Head-of-line blocking | One slow flow pauses an entire priority class | Innocent flows on the same priority freeze for tens of microseconds at a time. |
| PFC storms | Cascading PAUSE upstream | One bad receiver can backpressure 100 senders. Tail latency explodes. |
| All-or-nothing | PAUSE = 100% stop, RESUME = 100% go | No graceful rate control. Throughput oscillates badly. |
| L2 scope | PAUSE doesn't cross routed boundaries | Designs that want L3-routable lossless can't rely on PFC alone. |
| Headroom math | Buffer reservation grows with RTT × link speed | At 400G with 1 km cables, headroom alone consumes meaningful switch SRAM. At 800G, worse. |
PFC is a 2008-era solution. It was good enough when "scale" meant a 100-node iWARP cluster. At 100,000 GPUs on 400G/800G fabrics, every one of these failure modes becomes a real production incident.
The four escape architectures
Every hyperscaler now ships some variant of one of these:
Escape 1 — Smarter end-host CC, no PFC needed (delay-based or telemetry-based control)
- Google Swift (RTT-based)
- Alibaba HPCC (INT telemetry stamped by switches)
- UEC (industry standard, multi-signal)
Escape 2 — Replace IP/Ethernet with a cell-switched fabric (VOQ + per-flow scheduling + credit-based, like InfiniBand)
- Meta DSF (Data center Scale Fabric)
- Cisco SiliconOne (some deployments)
- Broadcom Jericho3-AI
Escape 3 — Replace RoCEv2 with a custom transport (run RDMA on your own reliable protocol, not on PFC)
- AWS EFA / SRD (Scalable Reliable Datagram)
- Google Falcon (hardware RDMA transport)
- UEC's UET (Ultra Ethernet Transport)
Escape 4 — Keep RoCEv2, add adaptive routing + tighter NIC/switch coupling
- NVIDIA Spectrum-X (adaptive routing, packet spraying, per-flow congestion isolation)
The rest of this page walks each hyperscaler's stack and points out the surprising choices.
Swift (Google, SIGCOMM 2020)
Idea: End-to-end delay is the congestion signal. No switch help, no PFC, no ECN required.
Sender Network Receiver
─────── ──────── ─────────
t0: send packet t1: receive
←─────── ACK ──── delay = t2 - t0 ───────→ t2: ack sent
Sender computes:
fabric_delay = ACK_delay - target_delay
if fabric_delay rising → cut rate
if fabric_delay falling → grow rate
Two-loop control. Fabric delay (network queue depth) is one signal; endpoint delay (NIC + host stack) is a separate signal. Sender reacts independently to each, which lets it distinguish "switch queue filling" from "receiver overloaded."
Why it works: modern NICs timestamp packets in hardware to roughly 10 ns precision. You can measure RTT precisely enough to see queueing buildup before the queue overflows. By the time DCQCN's CNP would have fired, Swift's sender has already backed off.
Result: 50 µs p99 latency at 100 Gbps under heavy load. No PFC events. Runs over commodity Ethernet.
Catch: requires hardware NIC timestamping (Mellanox CX-5+, Intel E810, Google's own silicon). And it's an end-host change — every sender in the fabric has to speak Swift.
Falcon (Google, SIGCOMM 2023)
Idea: don't run RoCEv2 at all. Build a hardware RDMA transport purpose-designed for the workload.
Falcon sits where RoCEv2 sits in your stack, but:
- Reliable delivery is in hardware (custom silicon, like CX-7 but Google's design — also shipping in the Intel E2100 IPU)
- Congestion control is Swift-style (delay-based), built into the NIC
- No PFC required — uses end-host packet pacing instead
- Packet ordering is relaxed — out-of-order delivery handled by the transport, not by the network
- Supports both RDMA semantics (verbs) and message-passing semantics on the same transport
Falcon is what runs in Google's GPU fabric. The TPU pods use a different proprietary fabric (Aquila + ICI).
The Ultra Ethernet Consortium spec borrows heavily from Falcon's ideas — out-of-order delivery, end-host pacing, no PFC. UET is, in many ways, the open-standards version of Falcon.
HPCC (Alibaba, SIGCOMM 2019)
Idea: switches embed per-hop queue depth + utilization directly into packet headers (INT — In-band Network Telemetry). Sender reads this and computes the perfect rate.
Sender ────packet [no INT data]────→ Switch1
│
│ stamp queue depth + util
↓
←────packet [INT vector]── Switch2
│
│ stamp queue depth + util
↓
Receiver
←────ACK with INT vector────
Sender sees: max queue depth across path = 30%
max link util across path = 65%
→ compute optimal rate in <1 RTT, no overshoot
Where DCQCN takes many CNPs to converge to the right rate (slow-start AIMD), HPCC sets the right rate in one round-trip because it has precise telemetry.
Result: sub-millisecond convergence. Near-zero PFC events. Better tail latency than DCQCN.
Catch: requires INT-capable switches. Tomahawk-3 and later support this. Older fabrics can't speak it.
Meta DSF — Data center Scale Fabric (announced Oct 2025)
Idea: throw out the Ethernet packet-switching model entirely for the AI fabric. Replace with a cell-switched, virtual-output-queued, credit-based fabric — basically InfiniBand wearing Ethernet clothes.
Traditional Ethernet (per-packet, ECMP-routed)
──────────────────────────────────────────────
Sender → switch decides hop-by-hop where to send
Each hop independently picks next link via hash
Queue buildup → PFC → potential deadlock
Meta DSF (cell-switched, end-to-end scheduled)
──────────────────────────────────────────────
Sender's NIC chops the packet into fixed-size CELLS
Each cell is independently scheduled across the fabric
Receiver reassembles cells in order
No buffer buildup → no PFC needed
Cells get distributed across ALL paths simultaneously (perfect ECMP)
The pieces:
- Cells — fixed-size chunks (typically 256–512 bytes), not variable packets
- VOQ (Virtual Output Queue) — sender holds cells in per-destination queues; receiver pulls when ready
- Credit-based flow control — like InfiniBand. Receiver grants credits, sender only sends what it has credit for. Zero overshoot, ever
- OCP-SAI + FBOSS — Meta's open switch software stack drives the fabric
- Scale claim — 18K GPUs in a single L2 zone, ~90K GPUs in an L3 region
The kicker: this is very close to how Cray's Slingshot and NVIDIA's NVLink scale-up fabrics work. Meta is essentially saying "Ethernet at AI scale needs to become InfiniBand-shaped while keeping the open ecosystem."
Ultra Ethernet Consortium (UEC)
UEC is the OCP/Linux Foundation-backed effort to standardize the post-PFC era of Ethernet for AI. Founders include AMD, Broadcom, Cisco, HPE, Intel, Meta, Microsoft, NVIDIA (late joiner), Oracle.
UEC 1.0 (published 2024):
- UET (Ultra Ethernet Transport) — replaces RoCEv2 as the wire protocol
- End-host packet pacing — mandatory; no PFC reliance
- Out-of-order delivery — like Falcon; receiver reorders in hardware
- Multi-path transmission — a single flow uses multiple paths simultaneously (packet spraying)
- Backward-compatible substrate — UET runs on standard Ethernet switches; it's an endpoint change
UEC is the shared bet of everyone who isn't Google or Meta. NVIDIA's ConnectX-9 and Spectrum-5 will speak UEC. Broadcom Thor3 + Tomahawk-6 already do. Microsoft is on the standards committee. UEC 1.1, expected mid-2026, finalizes the on-the-wire spec.
Per-vendor deep-dives
Meta — DSF, FBOSS, multi-vendor NICs
Public stance: "The AI fabric needs a redesign. Standard L3 Clos + RoCE doesn't scale to our footprint."
| Layer | What Meta runs |
|---|---|
| Switch OS | FBOSS (open, in-house, forked from earlier work in 2015) |
| Switch ASIC | Broadcom Tomahawk family + Cisco SiliconOne |
| Fabric architecture | DSF: cells + VOQ + credit-based CC |
| Lossless mechanism | Credit-based (no PFC at the fabric) |
| AI NIC vendors | Multi-vendor: Broadcom Thor, Marvell Octeon, NVIDIA CX-7 in some zones |
| Topology scale | 18K GPUs L2 zone, ~90K L3 region |
| Older designs | Standard RoCE on rail-Clos (still in production for non-AI) |
Why multi-vendor NICs: Meta's volume justifies dual-sourcing for cost negotiation. It also de-risks against a single-vendor outage or supply shock.
Why FBOSS over SONiC: Meta forked first (2015). SONiC came later (Microsoft, 2017). FBOSS is more deeply integrated with Meta's network management plane, and switching costs are now prohibitive.
The surprise: DSF is a generation ahead of standard rail-Clos + PFC. Meta is publicly betting that 100K+ GPU fabrics can't be done well on packet-switched Ethernet, and is putting cell-switched VOQ into production to prove it.
Google — Swift, Falcon, Aquila, custom everything
Public stance: "Why use standard RoCEv2 when we can build our own transport that's strictly better?"
| Layer | What Google runs |
|---|---|
| General DC RDMA CC | Swift (delay-based, no PFC) |
| AI fabric transport | Falcon (custom HW RDMA transport, also shipping on Intel E2100 IPU) |
| ECMP collision avoidance | PLB (Protective Load Balancing) — switch-side, tracks ECMP path utilization and re-routes long flows when they collide |
| Switch OS | Custom (Stratum + their own management) |
| Switch ASIC | Mostly Broadcom + Google custom (Aquila for TPU fabrics) |
| GPU fabric | NVLink for scale-up (DGX), Falcon for scale-out |
| TPU fabric | Custom 3D-torus optical via Aquila + ICI |
PLB is worth a deeper look. Standard ECMP just hashes the 5-tuple and lives with whatever collisions result. PLB measures path utilization and re-routes when it sees collision. It's the standards-track equivalent of Arista's DLB.
The Aquila chip is Google's custom AI fabric ASIC for TPU pods. It speaks a torus-style routing protocol, not standard Ethernet at all. Out of scope for RoCE-land, but worth knowing it exists.
The surprise: Google built Falcon because they own the entire stack down to the application code (Pathways, JAX). They can afford to break NIC-side ecosystem compatibility because their training framework is in-house. Almost no one else can.
NVIDIA reference — DGX SuperPOD + Spectrum-X + ConnectX
Public stance: "Here's the reference design that works. Everyone else is some variant of this."
| Layer | NVIDIA reference (DGX SuperPOD) |
|---|---|
| GPU | H100 / B100 / B200 / B300 |
| Scale-up fabric | NVLink (NV18 NVSwitch, 900 GB/s per GPU) |
| Scale-out fabric NIC | ConnectX-7 (400G) / ConnectX-8 (800G) |
| Switch | Spectrum-4 (NVIDIA) or merchant (Arista/Cisco/Dell on Broadcom) |
| Switch OS | Cumulus Linux (NVIDIA) or SONiC |
| Transport | RoCEv2 with DCQCN + PFC |
| Collective lib | NCCL + DOCA-OFED |
| Recommended topology | Rail-optimized Clos (3-tier for >2K GPUs) |
Spectrum-X is NVIDIA's competing fabric to Falcon/DSF. It's their answer to "everyone else is fleeing PFC." Spectrum-X adds:
- Adaptive routing (packet spraying across all paths)
- Per-flow congestion isolation
- Tighter DCQCN coupling between NIC and switch
- "RoCE+", a slight variant on standard RoCEv2
The surprise: NVIDIA isn't fleeing PFC — they're doubling down on it but making the NIC+switch tightly coupled so PFC almost never has to fire. The bet is that the operational simplicity of "still looks like RoCEv2" beats Falcon/DSF's clean-room redesigns for the typical buyer.
xAI's Colossus reportedly uses Spectrum-X end-to-end.
xAI — single-socket boxes, brutally aggressive scale
Public stance (from public talks and papers by Igor Babuschkin, Greg Yang, and others): "Optimize for scale, latency, and operational simplicity. Cut anything that gets in the way."
| Choice | xAI approach |
|---|---|
| Server form factor | Single-socket boxes (avoid NUMA crossings) |
| GPU count | H100 / H200 / B200, scaled to ~200K GPUs at Memphis Colossus |
| Fabric | Reportedly Spectrum-X at Memphis |
| NIC | ConnectX-7 / Spectrum-X coupled |
| Cooling | Liquid-cooled rear-door heat exchangers |
| Network ops team | Tiny — rumored ~10 people for the entire Colossus network |
| Time-to-deploy | Memphis built in months, not quarters |
xAI runs single-socket because they refuse to pay the NUMA tax. Dual-socket boxes are cheaper per GPU but introduce NUMA-pinning complications — every GPU/NIC has to be mapped to the right CPU socket, every collective has to respect the topology, every misconfiguration costs throughput. Single-socket eliminates the problem entirely. Every GPU and every NIC is on one CPU. Software complexity drops.
The trade: single-socket boxes have less aggregate CPU and RAM per GPU. If your training workload is GPU-bound (most are), the CPU savings don't matter. If you're doing heavy data-loader work or CPU-side preprocessing, single-socket starts to bottleneck.
The surprise: xAI moved fast. They built ~100K GPUs in Memphis in under a year by ruthlessly cutting anything that wasn't on the throughput-per-week critical path. The single-socket bet is a great example of choosing operational simplicity over per-rack cost optimization.
Microsoft Azure — the canonical DCQCN-at-400G shop
Public stance: "Standard RoCEv2 + DCQCN works, but you have to tune it precisely at every generation."
| Layer | What Azure runs |
|---|---|
| Switch ASIC | Mostly Broadcom (Tomahawk-3/4) |
| Switch OS | SONiC (Microsoft authored it!) |
| GPU instance type | ND H100 v5, ND H200 v6, etc. |
| SR-IOV | Yes — every VM gets a passthrough VF |
| NIC | ConnectX-7 (CX-6 in older SKUs) |
| Transport | RoCEv2 + DCQCN + PFC |
| Tuning | Published, aggressive 400G tuning |
The Microsoft DCQCN papers are the canonical references for tuning DCQCN at 400G. The headline insights:
- ECN Kmin/Kmax need to scale with link bandwidth — at 400G, target Kmin ≈ 1 MB, Kmax ≈ 5 MB
- DCQCN's Rai (additive increase rate) is too slow at 400G defaults — needs to be larger
- T_active and T_hai (the "how long without congestion before speeding up" timers) need to be shorter at 400G — congestion signals arrive faster
- CNP rate-limiting on receivers is critical to prevent CNP floods during PFC events
The surprise: Microsoft owns SONiC and the canonical DCQCN tuning playbook. At 100K+ GPUs across Azure, they have proven that standard RoCEv2 can be made to work — provided you treat DCQCN tuning as a per-generation engineering exercise, not a one-time config.
AWS — EFA, SRD, custom Annapurna silicon
Public stance: "RoCEv2 is fine for everyone else. We have custom silicon, so we're going to build something better."
| Layer | What AWS runs |
|---|---|
| NIC | Annapurna Nitro NIC (custom AWS silicon) |
| RDMA library | EFA (Elastic Fabric Adapter) — exposes libfabric, not verbs |
| Transport | SRD (Scalable Reliable Datagram) — NOT RoCEv2 |
| Multi-path | SRD sprays packets across multiple paths natively |
| Ordering | SRD allows out-of-order delivery to the app |
| Lossless | No PFC required — SRD handles re-transmits in-NIC at sub-µs |
| Per-VM passthrough | Yes — every EC2 instance gets a dedicated EFA |
| NCCL plugin | aws-ofi-nccl translates NCCL → libfabric → EFA |
SRD's design is worth understanding because it's the clearest preview of where the industry is going.
Traditional RoCEv2 + ECMP
─────────────────────────
Flow F → hash(5-tuple) → ALWAYS path A
Path A congested → flow F stalls
Other paths (B, C, D) sit idle
Per-flow ordering preserved (good)
Per-flow throughput limited to one path's capacity
SRD (AWS)
─────────
Flow F → packets sprayed across paths A, B, C, D
Each packet takes the least-congested path
Receiver reassembles out-of-order packets
Per-flow throughput = SUM of all paths' available capacity
App may see out-of-order delivery (must tolerate)
This is the same idea as UEC's multi-path mode and Meta DSF's cell spraying. SRD shipped years before UEC standardized it.
The catch: EFA isn't compatible with standard libibverbs. NCCL works via the aws-ofi-nccl plugin which translates NCCL → libfabric → EFA. Most code that "just works" on RoCEv2 needs a recompile to use EFA. AWS bears that ecosystem cost willingly because they control the platform.
The surprise: AWS happily broke verbs API compatibility to ship SRD in 2018. They were the first hyperscaler to fully exit RoCEv2, and SRD has been in production at the largest cloud scale for years before UEC even existed.
Alibaba — HPCC + in-network telemetry
Public stance: "Switches know more about congestion than endpoints. Let them help."
| Layer | What Alibaba runs |
|---|---|
| Transport | RoCEv2 (largely) |
| CC | HPCC (INT-based) |
| Lossless | Near-zero PFC events claimed in production |
| Switch ASIC | Tomahawk 3/4 + their own Hanguang silicon for inference fabric |
| Switch OS | Mixed; some SONiC, some proprietary |
| Cloud RDMA | eRDMA for tenant RDMA in VPC |
HPCC details (from the SIGCOMM 2019 paper):
- Switches stamp per-hop queue depth, link utilization, and timestamp into INT-capable headers
- Receiver echoes the INT vector back in the ACK
- Sender's algorithm uses INT data to compute "optimal window size" directly from the most-congested hop on path
- Convergence: 1 RTT to the optimal rate (vs DCQCN's 5–10 RTTs)
- Production at Alibaba: PFC pause counters drop by orders of magnitude vs DCQCN baseline
The surprise: HPCC is the proof point that INT-based CC works in production at scale. It pre-dates Falcon and Swift by years. UEC's multi-signal CC is in many ways an industrial generalization of HPCC's design philosophy.
Anthropic — pragmatic, less-documented, multi-cloud + Trainium
Public stance (mostly inferred — Anthropic publishes less than peers): "Use the best available silicon, optimize the application stack, treat the network as infrastructure not differentiation."
| Layer | What Anthropic uses |
|---|---|
| GPU types | NVIDIA H100, B200 + AWS Trainium2/3 |
| Where the compute lives | Heavily on AWS (EFA + Trainium) + GCP (NCCL + RoCE) |
| Network stack on AWS workloads | EFA + SRD via aws-ofi-nccl |
| Network stack on GCP workloads | Standard NCCL + RoCEv2 (via GCP A3 ultra instances) |
| Training framework | Internal (Claude training stack; JAX for some, PyTorch for others) |
What we can infer:
- Anthropic's network engineers spend more time on storage I/O, dataloader pipelines, and checkpoint replication than on RoCE tuning. The network largely "just works" because they ride cloud providers' battle-tested fabrics.
- They have strong opinions on interpretability of failures — they favor systems where every failure mode can be diagnosed end-to-end. That biases them toward NCCL + RoCEv2 (well-known failure modes) over EFA (newer, harder to debug from the application side) — except where AWS economics force EFA.
- Their public emphasis on safety and reliability of training translates to networking decisions favoring conservatism over peak throughput. Train Claude correctly at 95% of theoretical bandwidth; never train Claude unreliably at 110%.
The surprise: Anthropic publishes less about networking than Google or Meta because much of their differentiation is in how they use the network, not in custom network hardware. They're a vendor-fabric-characterization shop, not a custom-fabric-design shop.
The big comparison matrix
Side-by-side, the entire design space:
| Org | Transport | CC algo | Lossless mechanism | Multi-path | Switch silicon | NIC vendor | AI scale |
|---|---|---|---|---|---|---|---|
| Meta DSF (Oct 2025) | Cell-switched proprietary | Credit-based | NO PFC (credits) | Per-cell spray (perfect ECMP) | Broadcom + Cisco (FBOSS) | Multi-vendor: Broadcom Thor, Marvell, NVIDIA | ~90K L3 region |
| Google Swift + Falcon | RoCEv2 over Swift OR custom Falcon | Delay-based + Falcon HW | NO PFC (host pacing) | PLB (ECMP+) | Broadcom + Google Aquila | NVIDIA + Google custom (Falcon HW) | 100K+ at peak |
| NVIDIA reference | RoCEv2 + Spectrum-X RoCE+ | DCQCN (tunable) | PFC + ECN | Spectrum-X adaptive routing | Spectrum-4 or Broadcom | CX-7 / CX-8 | DGX SuperPOD reference |
| xAI Colossus | RoCEv2 + Spectrum-X | DCQCN + Spectrum-X | PFC + ECN | Spectrum-X adaptive routing | NVIDIA Spectrum-4 | NVIDIA CX-7 | ~200K GPUs |
| Microsoft Azure | RoCEv2 (standard) | DCQCN, 400G-tuned | PFC + ECN | 5-tuple ECMP | Broadcom (SONiC) | NVIDIA CX-7 | 100K+ across Azure |
| AWS EFA | SRD (custom) | End-host pacing | NO PFC (host re-tx) | Per-packet spray | AWS custom Annapurna | AWS Nitro / Annapurna | Largest in cloud |
| Alibaba HPCC | RoCEv2 + INT headers | HPCC (INT-based) | Near-zero PFC | 5-tuple ECMP | Broadcom (INT-capable) | NVIDIA CX-6/7 | ~100K |
| Anthropic | Mix: EFA (AWS) + RoCEv2 (GCP/own iron) + Trainium fabric | Inherits from cloud | Inherits from cloud | Inherits from cloud | Inherits from cloud | Mix | Renting at hyperscale |
| UEC future (industry) | UET (replaces RoCEv2) | End-host pacing + multi-signal | NO PFC (pacing) | Per-packet spray (default) | Any (open standard) | Any (open standard) | Target: 1M GPUs |
Four patterns jump out:
- PFC is on the way out. Every "future" design (DSF, Falcon, EFA, UEC) is PFC-free. The DCQCN+PFC stack is the conservative, well-understood option that buys time until UEC arrives.
- Multi-path is the new default. Everyone except the pure-DCQCN shops (Azure, Alibaba's RoCE side) is moving toward per-packet or per-cell spraying. Single-path 5-tuple ECMP is the relic.
- NVIDIA NIC dominance is real but cracking. AWS uses Annapurna. Meta runs multi-vendor (Broadcom Thor, Marvell, NVIDIA). Google has its own silicon (Falcon HW, Aquila). Microsoft Azure and xAI are still pure NVIDIA-NIC shops.
- Custom transport is the differentiator. Google (Falcon), AWS (SRD), Meta (DSF), and eventually UEC all replace the wire protocol. RoCEv2 is the laggard — well-loved, but technically the oldest design in the matrix.
The UEC / NIXL / Pensando / Thor3 roadmap
Three near-term shifts on a 6–24 month horizon.
Ultra Ethernet Consortium (UEC) — 12–24 month horizon
UEC 1.0 published. UEC 1.1, expected mid-2026, finalizes the on-the-wire spec.
NIC silicon supporting UEC:
- NVIDIA ConnectX-9 (rumored 2026) — first NVIDIA UEC NIC
- Broadcom Thor3 (shipping 2025) — already UEC-aware
- AMD Pensando Pollara 400 (2024–2025) — UEC-first design
Switch silicon supporting UEC:
- Broadcom Tomahawk-6 — first UEC-class switching ASIC
- NVIDIA Spectrum-5 — likely UEC-compatible alongside RoCE+
What changes when UEC arrives:
- DCQCN goes away. Replaced by UET's end-host pacing.
- PFC dependency eliminated. Switches won't need lossless queue config in the same way.
- ECMP collision becomes a non-issue (UET sprays packets natively).
- Per-host source routing becomes mostly irrelevant — UET handles per-flow source routing implicitly.
This is a host-stack rewrite. The way you reason about a RoCEv2 fabric does not survive UEC adoption.
NVIDIA NIXL — 6–12 month horizon
NIXL = NVIDIA Inference eXchange Library. Newer collective library targeting mixed-precision communication (FP16/FP8 gradients, INT8 weights). Aimed at the B300 era.
The idea: when your gradient is FP16 and your weight is INT8, why send both at full FP32 precision over the wire? NIXL does precision-aware compression in-NIC.
Becomes relevant when training shifts to FP8 / INT4 (Llama-style efficiency). Not urgent in 2026; important by 2027.
AMD Pensando + AMD Pollara
AMD bought Pensando in 2022. The Pollara 400 NIC is their first AI-targeting SmartNIC. It's UEC-native, supports both RDMA and SmartNIC functions (security offload, encryption, telemetry).
The relevance: AMD as a second-source NIC is increasingly viable. Even for shops that don't switch from NVIDIA, having a credible alternative in the procurement conversation is worth real money.
Intel IPU + Intel Mount Evans
Intel's SmartNIC line (Mount Evans, formerly Tofino) is currently weaker than NVIDIA/AMD in AI fabric features. Unlikely to be a real contender for B300-generation deployments; watch the next-gen. Note that Intel's E2100 IPU is the silicon Google's Falcon transport runs on, which is a meaningful Falcon-ecosystem signal.
Broadcom Thor3 / Thor4
Already shipping in production at Meta and others. Native UEC, multi-path, INT support. The real choice for any operator that wants to escape single-vendor NIC dependency without going to AMD.
What you should remember
- The hyperscaler RoCE design space has four escape routes from PFC: smarter end-host CC (Swift, HPCC), cell-switched fabrics (Meta DSF), custom transports (Falcon, SRD, UET), and adaptive routing on top of RoCEv2 (Spectrum-X).
- PFC + DCQCN works fine at 4K–16K GPUs. It starts to crack at 32K+ and breaks at 100K+. Every hyperscaler operating at 100K+ has either left RoCEv2 (AWS, Google), built around it (Meta DSF), or paired it with adaptive routing (NVIDIA Spectrum-X, xAI).
- Microsoft is the proof point that DCQCN can scale to 100K+ — but only if you treat 400G tuning as a per-generation engineering exercise, not a one-time config.
- AWS SRD is the cleanest preview of where the industry is going: packet spraying, out-of-order delivery, no PFC, hardware reassembly. UEC is essentially the open-standards version of the same idea.
- xAI's single-socket bet is the most underrated decision in the matrix. Eliminating NUMA is worth more than the CPU/RAM you give up, if your workload is GPU-bound (most AI training is).
- UEC adoption is a host-stack rewrite, not a config change. When CX-9 / Thor3 / Pollara ship in volume, the DCQCN-tuning skill set partially obsoletes and the UET-pacing skill set begins.
- The matrix in one line: NVIDIA wants to evolve RoCEv2 (Spectrum-X); Google and AWS have already replaced it (Falcon, SRD); Meta has replaced the fabric (DSF); Alibaba and Microsoft are squeezing the last out of RoCEv2 + DCQCN; UEC is where everyone meets in 2027.
Next: Switch QoS → — the switch-side configuration that makes any of these stacks actually deliver. DSCP-to-TC mapping, ECN watermarks, PFC headroom, buffer carving — the knobs you turn whether you're running plain RoCEv2 + DCQCN or paired with adaptive routing.