Understanding RDMA

Remote Direct Memory Access moves data between machines at wire speed — without touching the CPU or OS kernel. If you know TCP, you're 80% of the way there. This page builds the mental model.

TCP vs RDMA, at a glance

	TCP / sockets	RDMA
Data path	App → kernel → NIC	App → NIC directly
CPU cost	High — every packet involves the OS	Near-zero during transfer
Memory copies	2–3 per send	Zero-copy (DMA only)
Typical latency	50–200 µs	1–5 µs
Who handles ACKs	OS / TCP stack	NIC hardware (RNIC)

Why TCP has overhead

TCP was designed when CPU time was cheap and network speeds were slow. Every send() goes through a long chain, and the OS touches every byte at least twice.

TCP data flow — one send()

Each call costs a system call (a kernel mode switch), 2–3 memory copies, and multiple CPU interrupts on receive. At 10 Gbps and above, this stack is the bottleneck — not the wire.

Kernel bypass — RDMA's shortcut

RDMA cuts the OS out of the data path entirely. The application posts a Work Request straight to the NIC from user space. No system call. No kernel. No copies.

TCP path (8 hops)

RDMA path (2 hops)

The key piece is the RNIC — an RDMA-capable smart NIC with its own processor that understands RDMA verbs. You talk to it through the libibverbs user-space library; no kernel calls on the data path.

The three pillars

Kernel bypass	App talks directly to the RNIC via user-space verbs — no syscalls during data transfer.
Zero-copy	Data flows App memory → wire → remote App memory via DMA. The OS never touches it.
CPU offload	Reliability, ACKs, and ordering are handled entirely by RNIC hardware.

Memory registration — "here's where you can write"

Before any RDMA operation, the remote side must register a memory region. Registration pins physical pages and gives the NIC the right to DMA them. The result is an R_Key (remote key) and a virtual address — a shared secret that grants write access.

Memory registration handshake (one-time setup)

The R_Key is like a door key. Once Host B hands Host A an R_Key and a virtual address, A's RNIC can write directly into B's memory. B's CPU never wakes up — the data just appears.

What ibv_reg_mr() actually does

Pin physical pages. The OS locks the malloc'd buffer so it can't be swapped out. RDMA needs stable physical addresses.
Build the VA → PA mapping. The RNIC stores the translation table so it can DMA correctly when a write arrives.
Issue R_Key + L_Key. R_Key is shared with the remote sender. L_Key is used locally when the NIC reads your own send buffers.
Exchange over any channel. You ship VA + R_Key to Host A however you like — TCP socket, MPI init, config file. This setup happens once.

RDMA operations

Three operation types. Pick by direction of data flow and which CPU you want to involve.

A WRITE is a one-sided operation. Only Host A's CPU is involved. Host B's CPU is never interrupted — data appears in its buffer silently. Walk through the seven steps:

Step 11 / 7

Step 1 — App posts a Work Request

The application calls ibv_post_send() with a Work Request specifying: opcode=WRITE, local buffer address (L_Key), length, remote VA + R_Key. This is a user-space call — no kernel involved.

RDMA READ — "fetch from remote memory". Host A tells its RNIC: "go get the data at address X on Host B and drop it in my local buffer". Two phases — request goes out, data comes back. Still one-sided; B's CPU never wakes up.

Step 11 / 7

Step 1 — App posts a READ Work Request

The application calls ibv_post_send() with opcode=READ, the remote VA + R_Key (where to read from on Host B), the local buffer address + L_Key (where to deposit the data), and the length. The WR drops into the Send Queue from user space — no kernel involved.

READ vs WRITE: WRITE is push (A pushes into B). READ is pull (A pulls from B). Both are one-sided. The difference is latency — READ needs an extra round trip before data arrives.

SEND / RECV — "two-sided, like UDP". Unlike WRITE/READ, SEND requires the remote side to have posted a RECV buffer first. It's two-sided — both CPUs are involved on every operation, but there's no R_Key dance.

Step 11 / 7

Step 1 — Host B posts a RECV WR first

Receive-side setup happens before the SEND ever arrives. Application B calls ibv_post_recv() with a buffer + L_Key. The WR goes into B's Receive Queue, waiting to be matched against an incoming SEND.

When to use SEND vs WRITE: SEND for small control messages or when you'd rather skip memory registration. WRITE/READ for bulk data — faster, because B's CPU never wakes up.

Queue Pairs and reliability

Every RDMA connection uses a Queue Pair (QP): a Send Queue (SQ) and a Receive Queue (RQ). Instead of sockets, you post Work Requests into the SQ. The NIC processes them asynchronously, and the results show up in a Completion Queue (CQ).

Queue Pair architecture

QP types — pick your reliability level

Like TCP vs UDP, RDMA lets you choose how much reliability you need. You set this when you create the QP.

RC — Reliable Connection

The workhorse. Ordered, reliable, ACKed — think TCP for RDMA. Almost every production workload (MPI, storage, ML training) runs on RC. Supports WRITE, READ, and SEND/RECV. Trade-off: N² QPs for full N-to-N mesh.

How RDMA handles ACKs

In RC mode, every packet carries a PSN (Packet Sequence Number) — tracked entirely in hardware. No TCP timer, no kernel involvement. The receiving RNIC sends back an AETH (ACK Extended Transport Header) packet. On NACK, the sender RNIC retransmits.

RC reliable delivery — hardware ACK flow

Key headers in an RDMA packet

Header	What it carries
`LRH`	Local Routing Header — used by switches for forwarding within an IB subnet. Has credit-based flow control built in.
`BTH`	Base Transport Header — PSN, QP number, opcode (WRITE/READ/ACK etc.).
`RETH`	RDMA Extended Transport Header — remote VA and R_Key (WRITE/READ packets only).
`AETH`	ACK Extended Transport Header — in ACK/NACK packets. Carries the syndrome (success/NAK code) and MSN.
`ICRC`	Invariant CRC — detects bit errors. Corrupted packets are dropped; sender gets a NACK.

RoCE vs InfiniBand: RoCE v2 runs the same InfiniBand transport layer (BTH, RETH, AETH) over UDP/IP/Ethernet. The reliability semantics are identical — only the physical network and routing headers differ.

RDMA in one mental model

If you know TCP, here's the mapping:

TCP concept	RDMA equivalent
Socket	Queue Pair (QP)
`connect()`	`ibv_create_qp()` + CM setup
`send()` / `recv()`	`ibv_post_send()` / `ibv_post_recv()` (SEND/RECV)
—	`ibv_post_send(WRITE)` — remote CPU never wakes up
—	`ibv_post_send(READ)` — pull from remote memory silently
TCP sequence numbers	PSN (in the BTH header)
TCP ACK / retransmit	AETH header, all in RNIC hardware
Socket receive buffer	Registered memory region (`ibv_reg_mr`)
TCP port pair	QP number (24-bit)
`epoll` / completion	Completion Queue (CQ) — poll or event

The shift to internalise: TCP is message-oriented — the OS delivers data to your receive buffer. RDMA is memory-oriented — you give the NIC addresses, and it moves data between those addresses, bypassing everything else.