Verbs, Queue Pairs, Memory Regions
You know sockets. socket(), bind(), listen(), connect(), send(), recv(). Connection-oriented, byte-stream, synchronous(-ish), kernel-mediated.
Verbs is the RDMA equivalent. It's the API every RDMA app programs against — same on InfiniBand, RoCE v1, RoCE v2, and iWARP. The vocabulary is different, the mental model is different, and the kernel mostly isn't in the path. Here's the translation.
- Translate sockets into verbs — file descriptor → Queue Pair,
send()/recv()→ Work Request, socket buffer → Memory Region, returned bytes → CQE polled off a CQ. - Walk one message through six steps — post WR → NIC reads WR → DMA-read MR → transmit → remote DMA-write → CQE — and how WRITE skips step 6 while READ reverses direction.
- Address an RDMA endpoint —
(GID, QP num)as the L3/L4 analog, pick the right QP type (RC dominates AI training), and trace QP bring-up RESET → INIT → RTR → RTS. - Explain memory registration — pin pages, IOMMU access, the
lkey/rkeysplit, why it's microseconds-slow so you register once and reuse, and why the rkey is the access control.
The translation table
| Sockets concept | Verbs equivalent |
|---|---|
| File descriptor | Queue Pair (QP) — represents a "connection" |
send() / recv() | Work Request (WR) posted to a queue, processed by the NIC asynchronously |
| Socket buffer (in kernel) | Memory Region (MR) — registered in user space, NIC has direct DMA access |
recv() returning bytes | Completion Queue Entry (CQE) posted to a Completion Queue (CQ) |
| TCP connection (3-way handshake, sequence numbers) | RC (Reliable Connection) mode QP, established via out-of-band exchange |
| UDP socket | UD (Unreliable Datagram) mode QP |
A sockets app calls send() and waits for the kernel to do the work. A verbs app posts a work request to a queue and polls a completion queue later for the result. The NIC does the actual work in between. Asynchronous by design.
How one message moves
The full sequence — shown here for a two-sided SEND (the diagram's default case):
- App posts a Work Request to a Send Queue (one MMIO write to the NIC's doorbell — a memory-mapped register the NIC watches). The WR includes: which Memory Region holds the payload, how big it is, what operation (SEND / READ / WRITE), and which remote address + rkey to target.
- NIC reads the WR from the send queue. It's a small structure (~64 bytes).
- NIC DMA-reads the payload from the Memory Region. The data never touches CPU caches.
- NIC segments and transmits on the wire (RDMA-over-IB or RoCE v2 over UDP/IP, depending on transport).
- Remote NIC receives the packets, reassembles, DMA-writes directly into the receiver's Memory Region.
- Remote NIC posts a CQE to the Completion Queue. The receiving app polls the CQ and discovers the message arrived.
The CPU's involvement: post the WR (step 1), and optionally poll the CQ (after step 6). Everything in between is the NIC and PCIe.
How one-sided ops differ from the above:
- RDMA WRITE — same steps 1–5, but step 6 is skipped. The remote NIC writes the payload to memory and returns a hardware ACK to the sender; no CQE is posted on the receiver. The receiving app doesn't know the WRITE happened (that's the "Remote CPU not involved" property). Only the sender's CQE fires.
- RDMA READ — direction reverses. The local NIC sends a small READ request; the remote NIC DMA-reads its own Memory Region and sends the bytes back. The local NIC DMA-writes into local memory and posts a local CQE. Remote side never posts anything.
Queue Pairs — the "connection"
A QP is a pair of queues: Send Queue (SQ) and Receive Queue (RQ). The "pair" part is because every QP has both — even if you never use one. A QP on Host A pairs with a QP on Host B via an out-of-band exchange (typically over TCP) of their QP numbers, PSNs, and GIDs.
What's a GID? The Global Identifier is the RDMA L3 address — a 128-bit value (same shape as an IPv6 address). On RoCE v2 it's derived from the interface's IPv4 or IPv6 address (RoCE v2 carries a UDP/IP header on the wire, so the GID is the IP). On InfiniBand it's assigned by the Subnet Manager. The QP number is the L4 analog — closest to a TCP/UDP port. Together,
(GID, QP num)uniquely identifies one endpoint of an RDMA connection.
QPs come in flavors:
| QP type | Reliable? | Connection-oriented? | Used for |
|---|---|---|---|
| RC (Reliable Connection) | Yes | Yes (1:1) | Dominant in AI training. Most NCCL traffic. |
| UC (Unreliable Connection) | No | Yes (1:1) | Rare. Mostly research. |
| UD (Unreliable Datagram) | No | No (1:N) | Multicast, some HPC patterns. |
| RD (Reliable Datagram) | Yes | No (1:N) | InfiniBand only. Niche. |
| XRC (eXtended RC) | Yes | Shared receive | Memory-efficient at scale. Some HPC. |
For AI training fabrics, RC is what you'll see. NCCL opens an RC QP between every pair of GPUs that need to talk. With 8 GPUs per server and 1,000 servers, that's a lot of QPs — but ConnectX-7 supports millions per NIC, so it scales.
Memory Regions — the data path
Before a NIC can DMA into your memory, it has to know the memory is there and is allowed to be touched. This is memory registration.
When you register a memory region, three things happen:
- The kernel pins the pages (they can't be swapped out)
- The IOMMU is told the NIC can DMA-access these physical addresses
- The NIC returns two keys: lkey (local — used in send WRs to reference local buffers) and rkey (remote — given to the other side so they can RDMA-READ or RDMA-WRITE into your buffer)
Registration is slow (microseconds to milliseconds, depending on size). The app does it once at setup time, then reuses the MR for millions of operations. Pre-registering everything you'll ever touch is a standard pattern. NCCL pre-registers GPU HBM at job start; you never see a registration on the hot path.
The rkey is the access control. Whoever has your rkey can read/write that buffer remotely. Lose control of the rkey and you've handed someone direct memory access. (In practice, rkeys are short-lived and exchanged inside the trusted job.)
A complete worked example — RDMA WRITE
To send "hello world" from Host A to Host B with an RDMA WRITE:
- Both sides at setup:
- Register an MR covering the buffer to write from (Host A) and the buffer to write into (Host B)
- Exchange QP numbers, GIDs, PSNs, and Host B sends Host A its remote address + rkey — typically via
librdmacm(covered on the next page) or a hand-rolled TCP socket - Transition both QPs through INIT → RTR → RTS states
-
Host A posts the WR:
ibv_post_send(qp, &wr, &bad_wr);The WR specifies: opcode =
IBV_WR_RDMA_WRITE, local addr + lkey, remote addr + rkey, length. -
Host A NIC fires. Reads the WR, DMA-reads the payload from local MR, sends RDMA WRITE packets on the wire. Posts a CQE on Host A's CQ when complete.
-
Host B NIC receives. Validates rkey. DMA-writes the payload into Host B's MR at the specified address. No CQE on Host B's CQ — it's a one-sided operation; Host B's app doesn't know it happened.
-
Host A polls its CQ. Sees the completion. Considers the WRITE successful.
That's the entire thing. No send() on Host A. No recv() on Host B. The wire was talking to memory.
💡 What you should remember
| # | Concept | Why it matters | |
|---|---|---|---|
| 1 | 🔗 | QP = the connection (file descriptor analog) | RC is the dominant type for AI training. |
| 2 | 📍 | MR = registered memory the NIC has direct DMA access to | Register once at setup; reuse forever. |
| 3 | 📨 | WR = the unit of work | App posts to a queue; NIC processes asynchronously. |
| 4 | ✅ | CQE = the completion event | App polls the CQ to find out a WR finished. |
| 5 | 🌐 | GID = the RDMA L3 address (IPv6-shaped) | QP number = the L4 port analog. (GID, QP num) identifies an endpoint. |
| 6 | 🔑 | One-sided RDMA (READ / WRITE) requires an rkey the remote side gave you | Whoever has the rkey has direct memory access. |
| 7 | 🚪 | The hot path has no kernel involvement | The doorbell ring (MMIO) is the only kernel-adjacent operation, and it's a single 64-bit write. |
| 8 | 🔄 | QP setup goes through states — RESET → INIT → RTR → RTS | Stalls at INIT usually mean wrong peer GID/QPN. |
Next: RDMA in Production — Reliability, Setup, Errors → — how RC reliability works in hardware, what librdmacm events you'll see in logs, the WR features that show up in NCCL traces, and the completion error codes operators triage.