Skip to main content

NICs & DPUs

Every rail you wired in 3.3 Rail-Optimized Design terminates somewhere — and that somewhere is an RDMA NIC bolted to a GPU host. This is the host edge of the fabric: the last piece of silicon a packet touches before it lands in GPU memory, and the first piece that decides how fast it leaves. Get the NIC wrong and the whole fabric you designed upstream runs at the speed of its slowest endpoint.

The NIC market for AI fabrics is small and the players are well-known. This page catalogs them vendor by vendor, then explains what the NIC actually does inside an AI fabric, where a DPU differs from a plain NIC, and what the Ultra Ethernet wave changes.

After this page, you'll be able to
  1. Name the field — match each AI-fabric NIC and DPU to its vendor, speed, driver, and transport.
  2. Explain the NIC's job — kernel-bypass RDMA, GPUDirect, DCQCN rate control, SR-IOV slicing, and ECMP entropy.
  3. Tell a NIC from a DPU — know when the extra Arm cores and programmable datapath earn their cost.
  4. Place the UEC wave — see which NICs carry UET and how it relates to the RoCE v2 you already run.

1. The NICs and DPUs

This is the host-edge silicon you'll actually see in an AI cluster today. Equal billing — every one of these ships in production fabrics somewhere.

NIC / DPUVendorSpeedDriverTransport / notes
ConnectX-7 / ConnectX-8NVIDIA / Mellanox400G / 800Gmlx5RoCE v2. The de-facto AI NIC — NCCL-native, the one most reference designs assume.
BlueField-3NVIDIA400G DPUmlx5 + ArmDPU: Arm cores + programmable datapath, runs DOCA. Offloads isolation, storage, and security from the host.
Thor / Thor2Broadcom400G / 800Gbnxt_reRoCE v2. The open-ecosystem alternative — pairs with Broadcom's own switch silicon.
E810Intel100–200GirdmaRoCE v2 plus iWARP. The broad-deployment Ethernet NIC, lower speed tier.
EFAAWS100–400GlibfabricCustom SRD transport — not standard RoCE verbs. Cloud-only, you rent it, you don't rack it.
Pollara 400AMD400GUEC-ready AI NIC (Pensando-based). Pensando also ships as a programmable DPU.

A few things to read off this table:

  • Driver matters operationally. The driver name (mlx5, bnxt_re, irdma, libfabric) is what your host RDMA stack loads — it's the line in lsmod/ibv_devinfo that tells you which silicon you're on.
  • EFA is the odd one out. AWS's Elastic Fabric Adapter doesn't speak standard RoCE verbs at all — it runs a custom SRD (Scalable Reliable Datagram) transport under libfabric. You program to it through the same collective libraries, but the wire protocol is AWS's own and you only ever meet it inside AWS.
  • Two of these are DPUs, not NICs. BlueField-3 and the Pensando variant of Pollara add an onboard CPU — covered in section 3.

2. What a NIC does in an AI fabric

A regular server NIC moves bytes off the wire into kernel memory and lets the OS sort it out. An AI-fabric RDMA NIC does far more, and most of it is the OS getting out of the way.

  • Kernel-bypass RDMA. The NIC writes directly into application memory with no syscall, no copy, no kernel on the data path. This is the whole reason RDMA exists — see 3.5 RDMA for the verbs and queue-pair mechanics. CPU stays out of the hot loop.
  • GPUDirect. The NIC DMAs straight from GPU HBM to the wire and back, skipping the host CPU and host DRAM entirely. The GPU's memory is the send buffer. Without this, every byte would bounce through main memory and you'd lose the bandwidth you paid for.
  • DCQCN Reaction Point. When the fabric marks congestion, the NIC is the thing that slows down. It is the Reaction Point (RP) in the DCQCN control loop — it reads ECN marks, runs the rate-decrease/rate-increase math, and throttles its own send rate. The switch-side half lives in ECN & DCQCN; the NIC is where the reaction actually happens.
  • SR-IOV slicing. One physical NIC presents many Virtual Functions (VFs), so a single 400G port can be carved across multiple pods or tenants, each VF looking like its own NIC to its consumer. This is how you slice the host edge without buying a NIC per pod.
  • ECMP entropy source. The NIC sets the UDP source port on every RoCE v2 flow, and that src-port is exactly the field the fabric's ECMP hash keys on. The NIC supplies the entropy that spreads flows across the spine — get its src-port policy wrong and your load balancing collapses upstream. (The full load-balancing story is in chapter 4.)

Read those five together and the pattern is clear: in an AI fabric the NIC is not a dumb port — it's a rate controller, a DMA engine, a slicing layer, and the entropy source the whole fabric balances on.


3. NIC vs DPU

NIC versus DPU. Left, a NIC (ConnectX-7, Thor, E810): the host CPU runs the app plus isolation, storage, and security and burns host cores, while the NIC below provides the RDMA engine, queue pairs, and SR-IOV VFs out to RoCE v2. Right, a DPU (BlueField-3, Pensando): the host CPU just runs the app with cores freed, and the DPU below has the same NIC datapath plus onboard Arm cores and a programmable P4 datapath running isolation, storage, security, and telemetry on the card itself.
A NIC moves packets; the host CPU runs the services. A DPU adds onboard Arm cores that run those services on the card — freeing host cores and isolating tenants.

The line is simple.

A plain NIC moves packets. It does the five jobs above, fast, in fixed-function silicon — and that's its whole world. ConnectX-7/8, Thor/Thor2, and E810 are NICs.

A DPU is a NIC plus an onboard Arm CPU and a programmable datapath, so it can run software at the host edge and offload work off the host:

  • Tenant isolation — enforce per-tenant policy on the card, not in the host kernel.
  • Storage — terminate storage protocols on the DPU so the host CPU never sees them.
  • Security — run firewalling, encryption, and policy on the card, isolated from a potentially compromised host.
  • Telemetry — generate and export fabric telemetry from the datapath itself.

BlueField-3 (running NVIDIA's DOCA stack) and Pensando (AMD) are DPUs. The trade is straightforward: a DPU costs more and adds an OS you have to operate, but it claws back host CPU cycles and moves the trust boundary off the host. In a single-tenant training pod you often don't need one; in a multi-tenant or security-sensitive fabric, the offload is the point.


4. The UEC next wave

RoCE v2 is the transport almost every NIC in the table above speaks today — but it isn't the end of the line.

The Ultra Ethernet Consortium (UEC) is standardizing a new transport, UET (Ultra Ethernet Transport), purpose-built for AI and HPC collectives. The next generation of NICs — ConnectX-8, Thor2 / Thor3, and AMD Pollara — are UEC-ready and carry UET as the next-gen alternative to (and successor of) RoCE v2. You'll meet UET properly on the transport pages; here the only thing to anchor is that the NIC is where it lands. The same silicon you rack for RoCE v2 today is, increasingly, the silicon that will speak UET tomorrow.


💡 What you should remember

🃏Small field, known playersNVIDIA ConnectX/BlueField, Broadcom Thor, Intel E810, AWS EFA, AMD Pollara — that's effectively the whole AI-fabric NIC market.
🚀The NIC does five jobsKernel-bypass RDMA, GPUDirect from HBM, DCQCN Reaction Point, SR-IOV VFs, and the UDP src-port entropy ECMP hashes on.
🧠DPU = NIC + Arm CPU + programmable datapathOffloads isolation, storage, security, and telemetry off the host. BlueField and Pensando are DPUs; ConnectX/Thor/E810 are NICs.
☁️EFA breaks the RoCE moldAWS's EFA runs custom SRD over libfabric, not standard RoCE verbs — and only exists inside AWS.
🌊UEC is the next waveConnectX-8, Thor2/Thor3, and AMD Pollara carry UET as the next-gen successor to RoCE v2.

Next: Cluster Sizing & Cabling → — port counts, optics, cable lengths, and the BOM math that turns the design you've learned into a build plan.

For the rest of the host edge: how the NIC reaches the pod is in Host Networking, the RDMA it runs is in RDMA, its congestion reaction is in ECN & DCQCN, and the collective libraries that sit on top are in NCCL / RCCL / oneCCL.