Skip to main content

Vendor Stacks & Switch Silicon — A Procurement View

· 7 min read
Staff Network Engineer · RDMA & AI Fabric

TLDR: You can build the same logical AI fabric on five different vendor stacks — Spectrum-X, Tomahawk, Jericho3-AI + DDC, Cisco Silicon One G200, and Arista EOS. They are not interchangeable. This is the procurement view: what each vendor sells, who actually buys it, and the side-by-side comparison you'll do at RFP time.

1. NVIDIA Spectrum-X — the integrated stack

Spectrum-X is the only stack on this list where one vendor sells you both ends of the wire. The switch ASIC is Spectrum-4 (51.2 Tbps, 64×800G or 128×400G), and the NIC is ConnectX-7 (400G) or ConnectX-8 (800G). The two were co-designed: the switch tags packets, the NIC reacts, and the closed loop is what NVIDIA markets as "AI-tuned out of the box."

The headline features are real. Per-packet adaptive routing lives in the switch, so a single elephant flow gets sprayed across every spine link instead of pinning one. Congestion control terminates at the NIC — when the switch sees queue buildup, ConnectX reacts in hardware, not the kernel. Hardware INT marks every packet with its hop-by-hop dwell time so you can actually see where 200 µs of tail latency came from.

Who buys it: NVIDIA DGX SuperPOD reference designs ship Spectrum-X by default. Hyperscalers running NVIDIA-only racks pick it because the full stack is supported as one SKU. Enterprises buy it because they want one throat to choke when the training run misbehaves.

Trade-off: highest list price on the page (often 2–3× a Tomahawk whitebox per port), hardest to second-source, and you are betting on NVIDIA's silicon roadmap for the life of the cluster. The performance numbers are the best in the industry — and you pay for them.

2. Broadcom Tomahawk — the open Ethernet workhorse

Tomahawk is the silicon most hyperscaler AI fabrics actually run on. Broadcom does not sell switches; they sell chips to the whitebox builders (Edgecore, Celestica, UfiSpace) and to the brand-name OEMs (Arista, Dell, Juniper). Generations matter: Tomahawk 4 is 25.6 Tbps, Tomahawk 5 is 51.2 Tbps in a single chip (64×800G), and Tomahawk 6 is on the 102.4 Tbps roadmap.

The pitch is volume economics and openness. You get the highest port density per RU, standard Ethernet (no proprietary NIC required), ECMP plus dynamic load balancing, and a chip that ships in 10× more switches than any AI-specific silicon. The toolchain assumes you'll bring your own NOS — SONiC, FBOSS, or a commercial EOS/Cumulus.

Who buys it: Meta, Microsoft, AWS, Google — most hyperscaler AI fabrics are Tomahawk underneath, even when the front-of-rack badge says someone else. Tier-2 clouds and large enterprises buy it via Arista/Dell when they want the silicon without writing their own NOS.

Trade-off: more configuration work than Spectrum-X — adaptive routing, ECN/PFC tuning, and telemetry are not "on by default" the way they are on the NVIDIA stack. You trade integration polish for lowest $/port and the broadest vendor optionality on the market.

3. Broadcom Jericho3-AI + Ramon3 — the scheduled fabric

This is the other Broadcom stack, and it is architecturally different from Tomahawk. Jericho3-AI is the leaf ASIC, Ramon3 is the fabric ASIC, and together they implement a credit-based scheduled fabric — the chassis-architecture-disaggregated-into-a-rack idea that Broadcom markets as DDC (Distributed Disaggregated Chassis).

The features that matter: 8–16 GB of HBM per chip (vs ~80 MB of on-chip buffer on Tomahawk), virtual-output-queue (VOQ) scheduling so no ingress port can ever HOL-block another, and cell-based spraying across the fabric so every flow uses every link uniformly. The headline consequence: you do not need PFC. The fabric is lossless because it is scheduled, not because it backpressures.

Who buys it: Meta's Mistral / DSF AI clusters, and hyperscalers building 32K+ GPU single-fabric pods where Tomahawk's ECMP starts to lose efficiency. Anyone whose primary pain is "I cannot tune PFC at this scale" considers Jericho3-AI.

Trade-off: scheduling adds latency. Expect an extra ~1–5 µs per hop vs a plain Ethernet switch, and a narrower ecosystem — you are committed to Broadcom DDC end-to-end, with fewer NOS choices than Tomahawk.

4. Cisco Silicon One G200 — the hybrid player

Cisco's answer is the Silicon One family. G200 is the AI-fabric flagship at 51.2 Tbps, and the architectural pitch is "one ASIC, both modes": the same chip can run ECMP-style routing or credit-scheduled (DDC-style) on a per-deployment basis. P4-programmable pipeline, integrated 112G PAM4 SerDes, and INT-XD for hop-by-hop telemetry round out the feature set.

Who buys it: customers who want one silicon family across both their general DC and their AI back-end fabric — operational consistency matters more than squeezing the last 5% of $/port. Large enterprises with deep Cisco footprints buy it through Nexus 9000-series boxes; some service providers buy it via 8000-series for the routing flexibility.

Trade-off: smaller AI deployment footprint than Spectrum-X or Tomahawk, which means the AI-tuning playbooks are younger. The silicon is competitive on paper; the operational mileage at 16K+ GPU scale is less public than the Broadcom or NVIDIA stories.

5. Arista EOS — the software layer

Arista doesn't make ASICs. They build switches around Broadcom Tomahawk (7060X, 7368X) and Broadcom Jericho (7280R, 7800R), and the differentiator is EOS — the network operating system, the CLI, the telemetry pipeline, and CloudVision for fleet management.

The features that matter to operators: a Linux-underneath EOS where every state object is queryable via eAPI (JSON-RPC) or OpenConfig, streaming telemetry as a first-class citizen, multi-agent NOS architecture so a routing crash doesn't take the box down, and a CLI that is genuinely best-in-class for debugging at 3 AM.

Who buys it: most enterprise and tier-2 cloud AI fabrics. If your team already runs Arista in the front-end DC, you'll run Arista in the back-end fabric too. They are the "Cisco of AI fabrics" for operations — you don't buy them for unique hardware, you buy them for the software and the support contract.

Trade-off: you pay a meaningful software premium over the same Tomahawk silicon in a whitebox running SONiC. The math is: how much is your NetOps team's time worth?

6. Side-by-side

Vendor stackScale sweet spotLB approachTelemetryHyperscaler footprint$/port (relative)
NVIDIA Spectrum-X1K–16K GPUs, NVIDIA-onlyPer-packet adaptive (switch) + NIC CCHardware INT, AI-tunedLow (NVIDIA-aligned shops)High
Broadcom Tomahawk 51K–32K GPUs, BYO-NOSECMP + DLBIn-band telemetry, BYO collectorVery High (Meta, MSFT, AWS, GOOG)Low
Broadcom Jericho3-AI + DDC16K–100K+ GPUs, single fabricCredit-scheduled, cell spray, no PFC neededVOQ-aware, deep countersHigh (Meta Mistral)Medium-High
Cisco Silicon One G2001K–16K GPUs, Cisco shopsHybrid (ECMP or scheduled)INT-XD, P4-programmableLow–MediumMedium
Arista EOS (on Tomahawk/Jericho)1K–16K GPUs, enterprise + tier-2Whatever the underlying ASIC supportsStreaming telemetry, CloudVisionMediumMedium-High

The anti-pattern: single-vendor lock-in across the back-end

Betting your entire back-end fabric on one vendor's roadmap is a 24-month risk. The AI silicon market is consolidating — chips, NOS, and NIC roadmaps all slip. Multi-plane lets you mix vendors per plane: rail-plane A on Spectrum-X, rail-plane B on Tomahawk, storage plane on Arista/Jericho. Different blast radius, different procurement lever, same logical fabric. Use it.

What to remember

ConceptOne-liner
🥇Spectrum-X is integratedHighest performance out of the box, highest price, hardest to second-source.
🏗️Tomahawk runs the hyperscalers51.2 Tbps per chip, lowest $/port, most NOS options — but you tune it.
📦Jericho3-AI + DDC = scheduled fabricDeep buffers, no PFC, huge pods — pay ~1–5 µs per hop and lock to Broadcom.
🔄Cisco G200 is the hybrid betOne silicon for DC + AI; younger AI playbooks; good if you're already Cisco.
💻Arista is software, not siliconBest operator experience on someone else's chips — you pay an EOS premium.
⚠️Don't pick one stack for everythingMulti-plane = multi-vendor = lower procurement and roadmap risk.