Understanding RDMA
Remote Direct Memory Access moves data between machines at wire speed — without touching the CPU or OS kernel. If you know TCP, you're 80% of the way there. This page builds the mental model.
TCP vs RDMA, at a glance
| TCP / sockets | RDMA | |
|---|---|---|
| Data path | App → kernel → NIC | App → NIC directly |
| CPU cost | High — every packet involves the OS | Near-zero during transfer |
| Memory copies | 2–3 per send | Zero-copy (DMA only) |
| Typical latency | 50–200 µs | 1–5 µs |
| Who handles ACKs | OS / TCP stack | NIC hardware (RNIC) |
Why TCP has overhead
TCP was designed when CPU time was cheap and network speeds were slow. Every send() goes through a long chain, and the OS touches every byte at least twice.
Each call costs a system call (a kernel mode switch), 2–3 memory copies, and multiple CPU interrupts on receive. At 10 Gbps and above, this stack is the bottleneck — not the wire.
Kernel bypass — RDMA's shortcut
RDMA cuts the OS out of the data path entirely. The application posts a Work Request straight to the NIC from user space. No system call. No kernel. No copies.
The key piece is the RNIC — an RDMA-capable smart NIC with its own processor that understands RDMA verbs. You talk to it through the libibverbs user-space library; no kernel calls on the data path.
The three pillars
| Kernel bypass | App talks directly to the RNIC via user-space verbs — no syscalls during data transfer. |
| Zero-copy | Data flows App memory → wire → remote App memory via DMA. The OS never touches it. |
| CPU offload | Reliability, ACKs, and ordering are handled entirely by RNIC hardware. |
Memory registration — "here's where you can write"
Before any RDMA operation, the remote side must register a memory region. Registration pins physical pages and gives the NIC the right to DMA them. The result is an R_Key (remote key) and a virtual address — a shared secret that grants write access.
The R_Key is like a door key. Once Host B hands Host A an R_Key and a virtual address, A's RNIC can write directly into B's memory. B's CPU never wakes up — the data just appears.
What ibv_reg_mr() actually does
- Pin physical pages. The OS locks the malloc'd buffer so it can't be swapped out. RDMA needs stable physical addresses.
- Build the VA → PA mapping. The RNIC stores the translation table so it can DMA correctly when a write arrives.
- Issue R_Key + L_Key. R_Key is shared with the remote sender. L_Key is used locally when the NIC reads your own send buffers.
- Exchange over any channel. You ship VA + R_Key to Host A however you like — TCP socket, MPI init, config file. This setup happens once.
RDMA operations
Three operation types. Pick by direction of data flow and which CPU you want to involve.
A WRITE is a one-sided operation. Only Host A's CPU is involved. Host B's CPU is never interrupted — data appears in its buffer silently. Walk through the seven steps:
Step 1 — App posts a Work Request
The application calls ibv_post_send() with a Work Request specifying: opcode=WRITE, local buffer address (L_Key), length, remote VA + R_Key. This is a user-space call — no kernel involved.
RDMA READ — "fetch from remote memory". Host A tells its RNIC: "go get the data at address X on Host B and drop it in my local buffer". Two phases — request goes out, data comes back. Still one-sided; B's CPU never wakes up.
Step 1 — App posts a READ Work Request
The application calls ibv_post_send() with opcode=READ, the remote VA + R_Key (where to read from on Host B), the local buffer address + L_Key (where to deposit the data), and the length. The WR drops into the Send Queue from user space — no kernel involved.
READ vs WRITE: WRITE is push (A pushes into B). READ is pull (A pulls from B). Both are one-sided. The difference is latency — READ needs an extra round trip before data arrives.
SEND / RECV — "two-sided, like UDP". Unlike WRITE/READ, SEND requires the remote side to have posted a RECV buffer first. It's two-sided — both CPUs are involved on every operation, but there's no R_Key dance.
Step 1 — Host B posts a RECV WR first
Receive-side setup happens before the SEND ever arrives. Application B calls ibv_post_recv() with a buffer + L_Key. The WR goes into B's Receive Queue, waiting to be matched against an incoming SEND.
When to use SEND vs WRITE: SEND for small control messages or when you'd rather skip memory registration. WRITE/READ for bulk data — faster, because B's CPU never wakes up.
Queue Pairs and reliability
Every RDMA connection uses a Queue Pair (QP): a Send Queue (SQ) and a Receive Queue (RQ). Instead of sockets, you post Work Requests into the SQ. The NIC processes them asynchronously, and the results show up in a Completion Queue (CQ).
QP types — pick your reliability level
Like TCP vs UDP, RDMA lets you choose how much reliability you need. You set this when you create the QP.
RC — Reliable Connection
The workhorse. Ordered, reliable, ACKed — think TCP for RDMA. Almost every production workload (MPI, storage, ML training) runs on RC. Supports WRITE, READ, and SEND/RECV. Trade-off: N² QPs for full N-to-N mesh.
How RDMA handles ACKs
In RC mode, every packet carries a PSN (Packet Sequence Number) — tracked entirely in hardware. No TCP timer, no kernel involvement. The receiving RNIC sends back an AETH (ACK Extended Transport Header) packet. On NACK, the sender RNIC retransmits.
Key headers in an RDMA packet
| Header | What it carries |
|---|---|
LRH | Local Routing Header — used by switches for forwarding within an IB subnet. Has credit-based flow control built in. |
BTH | Base Transport Header — PSN, QP number, opcode (WRITE/READ/ACK etc.). |
RETH | RDMA Extended Transport Header — remote VA and R_Key (WRITE/READ packets only). |
AETH | ACK Extended Transport Header — in ACK/NACK packets. Carries the syndrome (success/NAK code) and MSN. |
ICRC | Invariant CRC — detects bit errors. Corrupted packets are dropped; sender gets a NACK. |
RoCE vs InfiniBand: RoCE v2 runs the same InfiniBand transport layer (BTH, RETH, AETH) over UDP/IP/Ethernet. The reliability semantics are identical — only the physical network and routing headers differ.
RDMA in one mental model
If you know TCP, here's the mapping:
| TCP concept | RDMA equivalent |
|---|---|
| Socket | Queue Pair (QP) |
connect() | ibv_create_qp() + CM setup |
send() / recv() | ibv_post_send() / ibv_post_recv() (SEND/RECV) |
| — | ibv_post_send(WRITE) — remote CPU never wakes up |
| — | ibv_post_send(READ) — pull from remote memory silently |
| TCP sequence numbers | PSN (in the BTH header) |
| TCP ACK / retransmit | AETH header, all in RNIC hardware |
| Socket receive buffer | Registered memory region (ibv_reg_mr) |
| TCP port pair | QP number (24-bit) |
epoll / completion | Completion Queue (CQ) — poll or event |
The shift to internalise: TCP is message-oriented — the OS delivers data to your receive buffer. RDMA is memory-oriented — you give the NIC addresses, and it moves data between those addresses, bypassing everything else.