Skip to main content

Understanding RDMA

Remote Direct Memory Access moves data between machines at wire speed — without touching the CPU or OS kernel. If you know TCP, you're 80% of the way there. This page builds the mental model.

TCP vs RDMA, at a glance

TCP / socketsRDMA
Data pathApp → kernel → NICApp → NIC directly
CPU costHigh — every packet involves the OSNear-zero during transfer
Memory copies2–3 per sendZero-copy (DMA only)
Typical latency50–200 µs1–5 µs
Who handles ACKsOS / TCP stackNIC hardware (RNIC)

Why TCP has overhead

TCP was designed when CPU time was cheap and network speeds were slow. Every send() goes through a long chain, and the OS touches every byte at least twice.

TCP data flow — one send()
USER SPACE KERNEL SPACE HARDWARE Application send(buf, len) syscall! Socket buf memcpy ① TCP/IP stack sequence numbers checksums, headers memcpy ② NIC driver kernel module CPU busy context switches interrupts scheduler overhead DMA NIC puts bits on wire ACK back through same long path

Each call costs a system call (a kernel mode switch), 2–3 memory copies, and multiple CPU interrupts on receive. At 10 Gbps and above, this stack is the bottleneck — not the wire.

Kernel bypass — RDMA's shortcut

RDMA cuts the OS out of the data path entirely. The application posts a Work Request straight to the NIC from user space. No system call. No kernel. No copies.

TCP path (8 hops)
Application syscall Socket layer TCP / IP stack NIC driver (kernel) NIC hardware CPU involved at every step
RDMA path (2 hops)
Application + ibverbs user lib Socket layer TCP/IP stack NIC driver ← BYPASSED RNIC (smart NIC) handles everything verb post CPU free during data transfer

The key piece is the RNIC — an RDMA-capable smart NIC with its own processor that understands RDMA verbs. You talk to it through the libibverbs user-space library; no kernel calls on the data path.

The three pillars

Kernel bypassApp talks directly to the RNIC via user-space verbs — no syscalls during data transfer.
Zero-copyData flows App memory → wire → remote App memory via DMA. The OS never touches it.
CPU offloadReliability, ACKs, and ordering are handled entirely by RNIC hardware.

Memory registration — "here's where you can write"

Before any RDMA operation, the remote side must register a memory region. Registration pins physical pages and gives the NIC the right to DMA them. The result is an R_Key (remote key) and a virtual address — a shared secret that grants write access.

Memory registration handshake (one-time setup)
HOST A (sender) Application A wants to write data to B RNIC A stores R_Key + VA HOST B (receiver) Application B malloc() → register region Registered Buffer VA = 0x7f3c80000000 R_Key = 0xAB12CD34 RNIC B pins pages, stores mapping issues R_Key token ibv_reg_mr() Out-of-band exchange (TCP socket, message, etc.) sends: VA + R_Key → Host A RDMA WRITE → directly into B's memory using R_Key

The R_Key is like a door key. Once Host B hands Host A an R_Key and a virtual address, A's RNIC can write directly into B's memory. B's CPU never wakes up — the data just appears.

What ibv_reg_mr() actually does

  1. Pin physical pages. The OS locks the malloc'd buffer so it can't be swapped out. RDMA needs stable physical addresses.
  2. Build the VA → PA mapping. The RNIC stores the translation table so it can DMA correctly when a write arrives.
  3. Issue R_Key + L_Key. R_Key is shared with the remote sender. L_Key is used locally when the NIC reads your own send buffers.
  4. Exchange over any channel. You ship VA + R_Key to Host A however you like — TCP socket, MPI init, config file. This setup happens once.

RDMA operations

Three operation types. Pick by direction of data flow and which CPU you want to involve.

A WRITE is a one-sided operation. Only Host A's CPU is involved. Host B's CPU is never interrupted — data appears in its buffer silently. Walk through the seven steps:

Step 11 / 7

Step 1 — App posts a Work Request

The application calls ibv_post_send() with a Work Request specifying: opcode=WRITE, local buffer address (L_Key), length, remote VA + R_Key. This is a user-space call — no kernel involved.

Application ibv_post_send(wr) Send Queue (SQ) WR: WRITE | R_Key | VA | len ← user-space only no syscall!

RDMA READ — "fetch from remote memory". Host A tells its RNIC: "go get the data at address X on Host B and drop it in my local buffer". Two phases — request goes out, data comes back. Still one-sided; B's CPU never wakes up.

Step 11 / 7

Step 1 — App posts a READ Work Request

The application calls ibv_post_send() with opcode=READ, the remote VA + R_Key (where to read from on Host B), the local buffer address + L_Key (where to deposit the data), and the length. The WR drops into the Send Queue from user space — no kernel involved.

Application A ibv_post_send(wr) Send Queue (SQ) WR: READ | R_Key | VA | len ← user-space only no syscall!

READ vs WRITE: WRITE is push (A pushes into B). READ is pull (A pulls from B). Both are one-sided. The difference is latency — READ needs an extra round trip before data arrives.

SEND / RECV — "two-sided, like UDP". Unlike WRITE/READ, SEND requires the remote side to have posted a RECV buffer first. It's two-sided — both CPUs are involved on every operation, but there's no R_Key dance.

Step 11 / 7

Step 1 — Host B posts a RECV WR first

Receive-side setup happens before the SEND ever arrives. Application B calls ibv_post_recv() with a buffer + L_Key. The WR goes into B's Receive Queue, waiting to be matched against an incoming SEND.

Application B ibv_post_recv(wr) Recv Queue (RQ) WR: RECV | buf | L_Key ← B sets up first before A sends

When to use SEND vs WRITE: SEND for small control messages or when you'd rather skip memory registration. WRITE/READ for bulk data — faster, because B's CPU never wakes up.

Queue Pairs and reliability

Every RDMA connection uses a Queue Pair (QP): a Send Queue (SQ) and a Receive Queue (RQ). Instead of sockets, you post Work Requests into the SQ. The NIC processes them asynchronously, and the results show up in a Completion Queue (CQ).

Queue Pair architecture
Application posts WR to SQ SQ Send Queue WR RQ Recv Queue RNIC processes WRs DMA engine sends/receives handles ACKs posts completions CQ (completions) Wire Remote RNIC validates R_Key DMA into memory sends ACK no CPU wakeup

QP types — pick your reliability level

Like TCP vs UDP, RDMA lets you choose how much reliability you need. You set this when you create the QP.

RC — Reliable Connection

The workhorse. Ordered, reliable, ACKed — think TCP for RDMA. Almost every production workload (MPI, storage, ML training) runs on RC. Supports WRITE, READ, and SEND/RECV. Trade-off: N² QPs for full N-to-N mesh.

How RDMA handles ACKs

In RC mode, every packet carries a PSN (Packet Sequence Number) — tracked entirely in hardware. No TCP timer, no kernel involvement. The receiving RNIC sends back an AETH (ACK Extended Transport Header) packet. On NACK, the sender RNIC retransmits.

RC reliable delivery — hardware ACK flow
SENDER RNIC RECEIVER RNIC DATA | PSN=1 | CRC | BTH DATA | PSN=2 | CRC | BTH DATA | PSN=3 | CRC | BTH ✕ dropped NACK | AETH | PSN=3 ← hardware NACK, no CPU involved DATA | PSN=3 | CRC | BTH ↻ retx ACK | AETH | PSN=3 CQ completion posted to app

Key headers in an RDMA packet

HeaderWhat it carries
LRHLocal Routing Header — used by switches for forwarding within an IB subnet. Has credit-based flow control built in.
BTHBase Transport Header — PSN, QP number, opcode (WRITE/READ/ACK etc.).
RETHRDMA Extended Transport Header — remote VA and R_Key (WRITE/READ packets only).
AETHACK Extended Transport Header — in ACK/NACK packets. Carries the syndrome (success/NAK code) and MSN.
ICRCInvariant CRC — detects bit errors. Corrupted packets are dropped; sender gets a NACK.

RoCE vs InfiniBand: RoCE v2 runs the same InfiniBand transport layer (BTH, RETH, AETH) over UDP/IP/Ethernet. The reliability semantics are identical — only the physical network and routing headers differ.

RDMA in one mental model

If you know TCP, here's the mapping:

TCP conceptRDMA equivalent
SocketQueue Pair (QP)
connect()ibv_create_qp() + CM setup
send() / recv()ibv_post_send() / ibv_post_recv() (SEND/RECV)
ibv_post_send(WRITE) — remote CPU never wakes up
ibv_post_send(READ) — pull from remote memory silently
TCP sequence numbersPSN (in the BTH header)
TCP ACK / retransmitAETH header, all in RNIC hardware
Socket receive bufferRegistered memory region (ibv_reg_mr)
TCP port pairQP number (24-bit)
epoll / completionCompletion Queue (CQ) — poll or event

The shift to internalise: TCP is message-oriented — the OS delivers data to your receive buffer. RDMA is memory-oriented — you give the NIC addresses, and it moves data between those addresses, bypassing everything else.