Switches for AI
Every GPU fabric is built out of switches — but the AI era narrowed the field to a handful of vendors and a handful of network operating systems. This page is the lay of the land: who builds the box, what OS it runs, whose silicon is inside, and what AI actually changes about a switch.
- Name the dominant AI switch vendors and the network OS each one runs.
- Tell which silicon sits under each platform — and why so many roads lead back to Broadcom.
- Explain the handful of switch features AI traffic specifically demands.
- Frame the two competing philosophies — NVIDIA's full stack versus merchant-silicon-plus-your-NOS — and why RoCE v2 keeps them interoperable.
Why an AI switch isn't a regular datacenter switch
An AI-fabric switch and a classic top-of-rack can have the same port count and the same 800G optics and still be completely different machines — because the traffic is different.
A classic datacenter carries millions of small, independent, mostly-TCP flows. Statistical multiplexing smooths the load, and the occasional dropped packet is fine — TCP just retransmits. The switch's job is cheap, fast forwarding.
AI training traffic is the opposite on every axis:
- Synchronized and bursty — every GPU finishes a compute step at nearly the same instant and fires AllReduce together: a wall of traffic, then silence, repeating every few milliseconds.
- A few enormous flows, not millions of tiny ones — one queue pair can be 400 Gbps, so ECMP's per-flow hash has almost nothing to spread.
- Incast — many GPUs send to one at the same time (the "reduce" step), piling up at a single egress port.
- Zero drop tolerance — RDMA's go-back-N retransmit is brutal, so one drop can stall a whole collective. The fabric has to be lossless.
That traffic forces hardware choices a normal ToR never makes:
| Dimension | Regular DC switch | AI-fabric switch |
|---|---|---|
| Packet buffer / RAM | modest on-chip, tens of MB, tuned for mixed bursts | either a big tuned on-chip pool (Tomahawk) or deep off-chip HBM buffers, GBs (Jericho) to absorb incast |
| Lossless | none — drops are fine, TCP recovers | PFC + ECN + DCQCN in silicon — per-priority no-drop queues, reserved headroom, WRED marking |
| Radix & speed | often 100G, lower aggregate | high radix at 800G (51.2 Tb/s, 64× 800G) to build big non-blocking Clos with fewer tiers |
| Latency / jitter | "good enough" | low and consistent — one slow path stalls the whole AllReduce, so jitter is the real enemy |
| Load balancing | ECMP (fine for many small flows) | adaptive routing / DLB / packet spraying in silicon, because ECMP collapses on elephant flows |
| Telemetry | SNMP / coarse counters | streaming per-queue / in-band telemetry to catch microbursts and PFC storms before they cascade |
So — is it "just more RAM"? Partly. More buffer is one answer (the deep-buffer Jericho approach), but it's not the only one: the shallow-buffer Tomahawk approach goes the other way — keep the buffer small and fast, and engineer losslessness with PFC and ECN instead. That fork — buffer it vs schedule it — is the silicon decision on the next page. The constant across both is everything else in the table: lossless, high radix, low jitter, adaptive routing, deep telemetry. Those are what make a switch an AI switch.
The players
Five camps own the AI-switch conversation. Each pairs a hardware platform with a network OS, and underneath almost all of them is a merchant ASIC — usually Broadcom — with two vendors shipping their own silicon as well.
| Vendor | AI switch platform | Network OS | Silicon under it | Where it fits |
|---|---|---|---|---|
| Spectrum-X (SN5600: 51.2 Tb/s, 64× 800G) | Cumulus Linux / NVUE | Its own Spectrum-4 silicon | Full-stack: adaptive routing in silicon, NIC + switch co-tuning | |
| 7060X (leaf/spine), 7800R3 (deep-buffer modular), 7700R4 AI (distributed Etherlink) | EOS | Broadcom Tomahawk + Jericho | Rich NOS, deep-buffer option, many large AI design wins | |
| Nexus 9000 (Cloud Scale) + Cisco 8000 (Silicon One) | NX-OS / IOS-XR | Broadcom + its own Cisco Silicon One | Enterprise reach + own silicon; Nexus HyperFabric AI | |
| QFX5000 (leaf), PTX (deep-buffer spine) | Junos | Broadcom + its own Express | Apstra automation, AI-native push (HPE acquired Juniper in 2025) | |
| Dell, Edgecore, Celestica, Supermicro (bare-metal boxes) | SONiC (community + Dell/Enterprise SONiC) | Broadcom | Most open, disaggregated, hyperscaler-favored |
NVIDIA is the only player selling a vertically integrated answer. Spectrum-X runs NVIDIA's own Spectrum-4 silicon, and the headline SN5600 pushes 51.2 Tb/s across 64 ports of 800G. The pitch is co-tuning: the switch and the ConnectX NIC are designed to cooperate, with adaptive routing baked into silicon.
Arista is the merchant-silicon heavyweight, running EOS on Broadcom Tomahawk and Jericho. The 7060X covers leaf and spine, the 7800R3 is the deep-buffer modular option, and the 7700R4 AI extends Etherlink into a distributed fabric. EOS plus a deep-buffer choice is why Arista shows up in so many large AI design wins.
Cisco brings enterprise reach and its own silicon. The Nexus 9000 Cloud Scale line runs NX-OS, while the Cisco 8000 runs IOS-XR on Cisco Silicon One — and Nexus HyperFabric AI is the packaged AI play.
Juniper, now part of HPE after the 2025 acquisition, runs Junos on the QFX5000 leaf and the deep-buffer PTX spine. Apstra automation and an AI-native push are the differentiators, with silicon split between Broadcom and Juniper's own Express.
White-box / open is the disaggregated camp: bare-metal boxes from Dell, Edgecore, Celestica, and Supermicro running SONiC — community or Dell/Enterprise SONiC — almost always on Broadcom. It is the most open, hyperscaler-favored path, where you buy the box and the OS separately.
What AI changes about a switch
A switch in an AI fabric carries a different burden than a switch in a classic web datacenter. The traffic is bursty, synchronized, and unforgiving of loss, so the box has to do more:
- Lossless RoCE — PFC plus ECN so RDMA traffic doesn't drop under congestion.
- Adaptive routing / DLB — spread elephant flows across paths instead of hashing them onto one and polarizing.
- Deep buffers vs shallow — absorb incast bursts, or keep latency low; the spine choice is a real tradeoff.
- High radix at 800G — pack as many 800G ports as possible per box to flatten the topology.
- Rich streaming telemetry — per-queue, per-port visibility fast enough to catch a microburst.
The two philosophies
Strip away the model numbers and there are really two ways to build an AI fabric.
NVIDIA's vertically integrated full stack — GPU plus NVLink plus ConnectX plus Spectrum plus CUDA/NCCL — is one product where every layer is tuned against every other. You buy the stack and the co-tuning comes with it.
Merchant-silicon plus choose-your-NOS is the other — Arista, Cisco, Juniper, and SONiC all building largely on Broadcom, letting you mix the box, the silicon, and the operating system. You give up some co-tuning and get openness, second sources, and negotiating leverage in return.
The reason these two worlds still talk to each other is RoCE v2 — the neutral protocol both philosophies speak. Pick either camp and the wire format between NIC and switch is the same; what differs is who tuned the stack and how much of it you bought from one vendor.
💡 What you should remember
| # | Concept | Why it matters | |
|---|---|---|---|
| 1 | 🏗️ | Five camps: NVIDIA, Arista, Cisco, Juniper/HPE, white-box SONiC | These are the realistic choices for an AI fabric switch tier. |
| 2 | 🔩 | Almost everyone rides Broadcom | Arista, Cisco, Juniper, and SONiC all build largely on Broadcom; NVIDIA and Cisco/Juniper also ship their own silicon. |
| 3 | 📦 | The OS is the differentiator | EOS, NX-OS/IOS-XR, Junos, Cumulus/NVUE, and SONiC are how vendors compete once the silicon is similar. |
| 4 | ⚡ | AI demands lossless + adaptive + telemetry | RoCE (PFC+ECN), adaptive routing/DLB, deep-vs-shallow buffers, 800G radix, and streaming telemetry are the AI-specific asks. |
| 5 | 🤝 | RoCE v2 keeps the camps interoperable | Full-stack NVIDIA and merchant-silicon-plus-your-NOS both speak the same neutral protocol on the wire. |
When you want the config side of this story, see Switch QoS — PFC and the 5-vendor config matrix. For what's actually inside the box, see Switch Silicon.
Next: Switch Silicon