Skip to main content

Switches for AI

Every GPU fabric is built out of switches — but the AI era narrowed the field to a handful of vendors and a handful of network operating systems. This page is the lay of the land: who builds the box, what OS it runs, whose silicon is inside, and what AI actually changes about a switch.

After this page, you'll be able to
  • Name the dominant AI switch vendors and the network OS each one runs.
  • Tell which silicon sits under each platform — and why so many roads lead back to Broadcom.
  • Explain the handful of switch features AI traffic specifically demands.
  • Frame the two competing philosophies — NVIDIA's full stack versus merchant-silicon-plus-your-NOS — and why RoCE v2 keeps them interoperable.

Why an AI switch isn't a regular datacenter switch

An AI-fabric switch and a classic top-of-rack can have the same port count and the same 800G optics and still be completely different machines — because the traffic is different.

A classic datacenter carries millions of small, independent, mostly-TCP flows. Statistical multiplexing smooths the load, and the occasional dropped packet is fine — TCP just retransmits. The switch's job is cheap, fast forwarding.

AI training traffic is the opposite on every axis:

  • Synchronized and bursty — every GPU finishes a compute step at nearly the same instant and fires AllReduce together: a wall of traffic, then silence, repeating every few milliseconds.
  • A few enormous flows, not millions of tiny ones — one queue pair can be 400 Gbps, so ECMP's per-flow hash has almost nothing to spread.
  • Incast — many GPUs send to one at the same time (the "reduce" step), piling up at a single egress port.
  • Zero drop tolerance — RDMA's go-back-N retransmit is brutal, so one drop can stall a whole collective. The fabric has to be lossless.

That traffic forces hardware choices a normal ToR never makes:

DimensionRegular DC switchAI-fabric switch
Packet buffer / RAMmodest on-chip, tens of MB, tuned for mixed burstseither a big tuned on-chip pool (Tomahawk) or deep off-chip HBM buffers, GBs (Jericho) to absorb incast
Losslessnone — drops are fine, TCP recoversPFC + ECN + DCQCN in silicon — per-priority no-drop queues, reserved headroom, WRED marking
Radix & speedoften 100G, lower aggregatehigh radix at 800G (51.2 Tb/s, 64× 800G) to build big non-blocking Clos with fewer tiers
Latency / jitter"good enough"low and consistent — one slow path stalls the whole AllReduce, so jitter is the real enemy
Load balancingECMP (fine for many small flows)adaptive routing / DLB / packet spraying in silicon, because ECMP collapses on elephant flows
TelemetrySNMP / coarse countersstreaming per-queue / in-band telemetry to catch microbursts and PFC storms before they cascade

So — is it "just more RAM"? Partly. More buffer is one answer (the deep-buffer Jericho approach), but it's not the only one: the shallow-buffer Tomahawk approach goes the other way — keep the buffer small and fast, and engineer losslessness with PFC and ECN instead. That fork — buffer it vs schedule it — is the silicon decision on the next page. The constant across both is everything else in the table: lossless, high radix, low jitter, adaptive routing, deep telemetry. Those are what make a switch an AI switch.

The players

Five camps own the AI-switch conversation. Each pairs a hardware platform with a network OS, and underneath almost all of them is a merchant ASIC — usually Broadcom — with two vendors shipping their own silicon as well.

VendorAI switch platformNetwork OSSilicon under itWhere it fits
NVIDIA NVIDIASpectrum-X (SN5600: 51.2 Tb/s, 64× 800G)Cumulus Linux / NVUEIts own Spectrum-4 siliconFull-stack: adaptive routing in silicon, NIC + switch co-tuning
Arista Arista7060X (leaf/spine), 7800R3 (deep-buffer modular), 7700R4 AI (distributed Etherlink)EOSBroadcom Tomahawk + JerichoRich NOS, deep-buffer option, many large AI design wins
Cisco CiscoNexus 9000 (Cloud Scale) + Cisco 8000 (Silicon One)NX-OS / IOS-XRBroadcom + its own Cisco Silicon OneEnterprise reach + own silicon; Nexus HyperFabric AI
Juniper Juniper (now HPE)QFX5000 (leaf), PTX (deep-buffer spine)JunosBroadcom + its own ExpressApstra automation, AI-native push (HPE acquired Juniper in 2025)
SONiC White-box / openDell, Edgecore, Celestica, Supermicro (bare-metal boxes)SONiC (community + Dell/Enterprise SONiC)BroadcomMost open, disaggregated, hyperscaler-favored
Logos: NVIDIA and Cisco marks via Simple Icons (CC0); Arista, Juniper, and SONiC shown as rendered wordmarks. All product names and logos are trademarks of their respective owners, used here for identification only.

NVIDIA is the only player selling a vertically integrated answer. Spectrum-X runs NVIDIA's own Spectrum-4 silicon, and the headline SN5600 pushes 51.2 Tb/s across 64 ports of 800G. The pitch is co-tuning: the switch and the ConnectX NIC are designed to cooperate, with adaptive routing baked into silicon.

Arista is the merchant-silicon heavyweight, running EOS on Broadcom Tomahawk and Jericho. The 7060X covers leaf and spine, the 7800R3 is the deep-buffer modular option, and the 7700R4 AI extends Etherlink into a distributed fabric. EOS plus a deep-buffer choice is why Arista shows up in so many large AI design wins.

Cisco brings enterprise reach and its own silicon. The Nexus 9000 Cloud Scale line runs NX-OS, while the Cisco 8000 runs IOS-XR on Cisco Silicon One — and Nexus HyperFabric AI is the packaged AI play.

Juniper, now part of HPE after the 2025 acquisition, runs Junos on the QFX5000 leaf and the deep-buffer PTX spine. Apstra automation and an AI-native push are the differentiators, with silicon split between Broadcom and Juniper's own Express.

White-box / open is the disaggregated camp: bare-metal boxes from Dell, Edgecore, Celestica, and Supermicro running SONiC — community or Dell/Enterprise SONiC — almost always on Broadcom. It is the most open, hyperscaler-favored path, where you buy the box and the OS separately.

What AI changes about a switch

A switch in an AI fabric carries a different burden than a switch in a classic web datacenter. The traffic is bursty, synchronized, and unforgiving of loss, so the box has to do more:

  • Lossless RoCE — PFC plus ECN so RDMA traffic doesn't drop under congestion.
  • Adaptive routing / DLB — spread elephant flows across paths instead of hashing them onto one and polarizing.
  • Deep buffers vs shallow — absorb incast bursts, or keep latency low; the spine choice is a real tradeoff.
  • High radix at 800G — pack as many 800G ports as possible per box to flatten the topology.
  • Rich streaming telemetry — per-queue, per-port visibility fast enough to catch a microburst.

The two philosophies

Two ways to build an AI cluster shown as parallel layered stacks. Left, vertically integrated NVIDIA: H100/B200 accelerator, NVLink + NVSwitch scale-up, ConnectX/BlueField NIC, Spectrum-X switch on NVIDIA's own silicon, CUDA + NCCL software. Right, open/merchant: AMD MI300X or Intel Gaudi, Infinity Fabric or UALink, Broadcom Thor or Intel E810 NIC, Arista/Cisco/SONiC switch on Broadcom silicon, ROCm + RCCL/oneCCL. A full-width bar underneath both reads RoCE v2 over Ethernet — the neutral ground both stacks meet on.
Same five layers, two vendor strategies — and one shared wire protocol. That common RoCE v2 layer is why this course is vendor-neutral.

Strip away the model numbers and there are really two ways to build an AI fabric.

NVIDIA's vertically integrated full stack — GPU plus NVLink plus ConnectX plus Spectrum plus CUDA/NCCL — is one product where every layer is tuned against every other. You buy the stack and the co-tuning comes with it.

Merchant-silicon plus choose-your-NOS is the other — Arista, Cisco, Juniper, and SONiC all building largely on Broadcom, letting you mix the box, the silicon, and the operating system. You give up some co-tuning and get openness, second sources, and negotiating leverage in return.

The reason these two worlds still talk to each other is RoCE v2 — the neutral protocol both philosophies speak. Pick either camp and the wire format between NIC and switch is the same; what differs is who tuned the stack and how much of it you bought from one vendor.

💡 What you should remember

#ConceptWhy it matters
1🏗️Five camps: NVIDIA, Arista, Cisco, Juniper/HPE, white-box SONiCThese are the realistic choices for an AI fabric switch tier.
2🔩Almost everyone rides BroadcomArista, Cisco, Juniper, and SONiC all build largely on Broadcom; NVIDIA and Cisco/Juniper also ship their own silicon.
3📦The OS is the differentiatorEOS, NX-OS/IOS-XR, Junos, Cumulus/NVUE, and SONiC are how vendors compete once the silicon is similar.
4AI demands lossless + adaptive + telemetryRoCE (PFC+ECN), adaptive routing/DLB, deep-vs-shallow buffers, 800G radix, and streaming telemetry are the AI-specific asks.
5🤝RoCE v2 keeps the camps interoperableFull-stack NVIDIA and merchant-silicon-plus-your-NOS both speak the same neutral protocol on the wire.

When you want the config side of this story, see Switch QoS — PFC and the 5-vendor config matrix. For what's actually inside the box, see Switch Silicon.

Next: Switch Silicon