NICs & DPUs
Every rail you wired in 3.3 Rail-Optimized Design terminates somewhere — and that somewhere is an RDMA NIC bolted to a GPU host. This is the host edge of the fabric: the last piece of silicon a packet touches before it lands in GPU memory, and the first piece that decides how fast it leaves. Get the NIC wrong and the whole fabric you designed upstream runs at the speed of its slowest endpoint.
The NIC market for AI fabrics is small and the players are well-known. This page catalogs them vendor by vendor, then explains what the NIC actually does inside an AI fabric, where a DPU differs from a plain NIC, and what the Ultra Ethernet wave changes.
- Name the field — match each AI-fabric NIC and DPU to its vendor, speed, driver, and transport.
- Explain the NIC's job — kernel-bypass RDMA, GPUDirect, DCQCN rate control, SR-IOV slicing, and ECMP entropy.
- Tell a NIC from a DPU — know when the extra Arm cores and programmable datapath earn their cost.
- Place the UEC wave — see which NICs carry UET and how it relates to the RoCE v2 you already run.
1. The NICs and DPUs
This is the host-edge silicon you'll actually see in an AI cluster today. Equal billing — every one of these ships in production fabrics somewhere.
| NIC / DPU | Vendor | Speed | Driver | Transport / notes |
|---|---|---|---|---|
| ConnectX-7 / ConnectX-8 | NVIDIA / Mellanox | 400G / 800G | mlx5 | RoCE v2. The de-facto AI NIC — NCCL-native, the one most reference designs assume. |
| BlueField-3 | NVIDIA | 400G DPU | mlx5 + Arm | DPU: Arm cores + programmable datapath, runs DOCA. Offloads isolation, storage, and security from the host. |
| Thor / Thor2 | Broadcom | 400G / 800G | bnxt_re | RoCE v2. The open-ecosystem alternative — pairs with Broadcom's own switch silicon. |
| E810 | Intel | 100–200G | irdma | RoCE v2 plus iWARP. The broad-deployment Ethernet NIC, lower speed tier. |
| EFA | AWS | 100–400G | libfabric | Custom SRD transport — not standard RoCE verbs. Cloud-only, you rent it, you don't rack it. |
| Pollara 400 | AMD | 400G | — | UEC-ready AI NIC (Pensando-based). Pensando also ships as a programmable DPU. |
A few things to read off this table:
- Driver matters operationally. The driver name (
mlx5,bnxt_re,irdma,libfabric) is what your host RDMA stack loads — it's the line inlsmod/ibv_devinfothat tells you which silicon you're on. - EFA is the odd one out. AWS's Elastic Fabric Adapter doesn't speak standard RoCE verbs at all — it runs a custom SRD (Scalable Reliable Datagram) transport under
libfabric. You program to it through the same collective libraries, but the wire protocol is AWS's own and you only ever meet it inside AWS. - Two of these are DPUs, not NICs. BlueField-3 and the Pensando variant of Pollara add an onboard CPU — covered in section 3.
2. What a NIC does in an AI fabric
A regular server NIC moves bytes off the wire into kernel memory and lets the OS sort it out. An AI-fabric RDMA NIC does far more, and most of it is the OS getting out of the way.
- Kernel-bypass RDMA. The NIC writes directly into application memory with no syscall, no copy, no kernel on the data path. This is the whole reason RDMA exists — see 3.5 RDMA for the verbs and queue-pair mechanics. CPU stays out of the hot loop.
- GPUDirect. The NIC DMAs straight from GPU HBM to the wire and back, skipping the host CPU and host DRAM entirely. The GPU's memory is the send buffer. Without this, every byte would bounce through main memory and you'd lose the bandwidth you paid for.
- DCQCN Reaction Point. When the fabric marks congestion, the NIC is the thing that slows down. It is the Reaction Point (RP) in the DCQCN control loop — it reads ECN marks, runs the rate-decrease/rate-increase math, and throttles its own send rate. The switch-side half lives in ECN & DCQCN; the NIC is where the reaction actually happens.
- SR-IOV slicing. One physical NIC presents many Virtual Functions (VFs), so a single 400G port can be carved across multiple pods or tenants, each VF looking like its own NIC to its consumer. This is how you slice the host edge without buying a NIC per pod.
- ECMP entropy source. The NIC sets the UDP source port on every RoCE v2 flow, and that src-port is exactly the field the fabric's ECMP hash keys on. The NIC supplies the entropy that spreads flows across the spine — get its src-port policy wrong and your load balancing collapses upstream. (The full load-balancing story is in chapter 4.)
Read those five together and the pattern is clear: in an AI fabric the NIC is not a dumb port — it's a rate controller, a DMA engine, a slicing layer, and the entropy source the whole fabric balances on.
3. NIC vs DPU
The line is simple.
A plain NIC moves packets. It does the five jobs above, fast, in fixed-function silicon — and that's its whole world. ConnectX-7/8, Thor/Thor2, and E810 are NICs.
A DPU is a NIC plus an onboard Arm CPU and a programmable datapath, so it can run software at the host edge and offload work off the host:
- Tenant isolation — enforce per-tenant policy on the card, not in the host kernel.
- Storage — terminate storage protocols on the DPU so the host CPU never sees them.
- Security — run firewalling, encryption, and policy on the card, isolated from a potentially compromised host.
- Telemetry — generate and export fabric telemetry from the datapath itself.
BlueField-3 (running NVIDIA's DOCA stack) and Pensando (AMD) are DPUs. The trade is straightforward: a DPU costs more and adds an OS you have to operate, but it claws back host CPU cycles and moves the trust boundary off the host. In a single-tenant training pod you often don't need one; in a multi-tenant or security-sensitive fabric, the offload is the point.
4. The UEC next wave
RoCE v2 is the transport almost every NIC in the table above speaks today — but it isn't the end of the line.
The Ultra Ethernet Consortium (UEC) is standardizing a new transport, UET (Ultra Ethernet Transport), purpose-built for AI and HPC collectives. The next generation of NICs — ConnectX-8, Thor2 / Thor3, and AMD Pollara — are UEC-ready and carry UET as the next-gen alternative to (and successor of) RoCE v2. You'll meet UET properly on the transport pages; here the only thing to anchor is that the NIC is where it lands. The same silicon you rack for RoCE v2 today is, increasingly, the silicon that will speak UET tomorrow.
💡 What you should remember
| 🃏 | Small field, known players | NVIDIA ConnectX/BlueField, Broadcom Thor, Intel E810, AWS EFA, AMD Pollara — that's effectively the whole AI-fabric NIC market. |
| 🚀 | The NIC does five jobs | Kernel-bypass RDMA, GPUDirect from HBM, DCQCN Reaction Point, SR-IOV VFs, and the UDP src-port entropy ECMP hashes on. |
| 🧠 | DPU = NIC + Arm CPU + programmable datapath | Offloads isolation, storage, security, and telemetry off the host. BlueField and Pensando are DPUs; ConnectX/Thor/E810 are NICs. |
| ☁️ | EFA breaks the RoCE mold | AWS's EFA runs custom SRD over libfabric, not standard RoCE verbs — and only exists inside AWS. |
| 🌊 | UEC is the next wave | ConnectX-8, Thor2/Thor3, and AMD Pollara carry UET as the next-gen successor to RoCE v2. |
Next: Cluster Sizing & Cabling → — port counts, optics, cable lengths, and the BOM math that turns the design you've learned into a build plan.
For the rest of the host edge: how the NIC reaches the pod is in Host Networking, the RDMA it runs is in RDMA, its congestion reaction is in ECN & DCQCN, and the collective libraries that sit on top are in NCCL / RCCL / oneCCL.