Skip to main content

Verbs, Queue Pairs, Memory Regions

You know sockets. socket(), bind(), listen(), connect(), send(), recv(). Connection-oriented, byte-stream, synchronous(-ish), kernel-mediated.

Verbs is the RDMA equivalent. It's the API every RDMA app programs against — same on InfiniBand, RoCE v1, RoCE v2, and iWARP. The vocabulary is different, the mental model is different, and the kernel mostly isn't in the path. Here's the translation.

After this page, you'll be able to
  1. Translate sockets into verbs — file descriptor → Queue Pair, send()/recv()Work Request, socket buffer → Memory Region, returned bytes → CQE polled off a CQ.
  2. Walk one message through six steps — post WR → NIC reads WR → DMA-read MR → transmit → remote DMA-write → CQE — and how WRITE skips step 6 while READ reverses direction.
  3. Address an RDMA endpoint(GID, QP num) as the L3/L4 analog, pick the right QP type (RC dominates AI training), and trace QP bring-up RESET → INIT → RTR → RTS.
  4. Explain memory registration — pin pages, IOMMU access, the lkey/rkey split, why it's microseconds-slow so you register once and reuse, and why the rkey is the access control.

The translation table

Sockets conceptVerbs equivalent
File descriptorQueue Pair (QP) — represents a "connection"
send() / recv()Work Request (WR) posted to a queue, processed by the NIC asynchronously
Socket buffer (in kernel)Memory Region (MR) — registered in user space, NIC has direct DMA access
recv() returning bytesCompletion Queue Entry (CQE) posted to a Completion Queue (CQ)
TCP connection (3-way handshake, sequence numbers)RC (Reliable Connection) mode QP, established via out-of-band exchange
UDP socketUD (Unreliable Datagram) mode QP

A sockets app calls send() and waits for the kernel to do the work. A verbs app posts a work request to a queue and polls a completion queue later for the result. The NIC does the actual work in between. Asynchronous by design.


How one message moves

RDMA work request flow showing six steps. Host A sender: (1) app posts work request to the send queue, (2) NIC reads the WR, (3) NIC DMA-reads the payload from the registered memory region, (4) NIC sends across the wire. Host B receiver: (5) NIC DMA-writes the payload directly into the receiver's registered memory region, (6) a completion queue entry is posted for the app to poll.
One RDMA message, six steps. The CPU does only step 1 — post the work request. Everything else is the NIC.

The full sequence — shown here for a two-sided SEND (the diagram's default case):

  1. App posts a Work Request to a Send Queue (one MMIO write to the NIC's doorbell — a memory-mapped register the NIC watches). The WR includes: which Memory Region holds the payload, how big it is, what operation (SEND / READ / WRITE), and which remote address + rkey to target.
  2. NIC reads the WR from the send queue. It's a small structure (~64 bytes).
  3. NIC DMA-reads the payload from the Memory Region. The data never touches CPU caches.
  4. NIC segments and transmits on the wire (RDMA-over-IB or RoCE v2 over UDP/IP, depending on transport).
  5. Remote NIC receives the packets, reassembles, DMA-writes directly into the receiver's Memory Region.
  6. Remote NIC posts a CQE to the Completion Queue. The receiving app polls the CQ and discovers the message arrived.

The CPU's involvement: post the WR (step 1), and optionally poll the CQ (after step 6). Everything in between is the NIC and PCIe.

How one-sided ops differ from the above:

  • RDMA WRITE — same steps 1–5, but step 6 is skipped. The remote NIC writes the payload to memory and returns a hardware ACK to the sender; no CQE is posted on the receiver. The receiving app doesn't know the WRITE happened (that's the "Remote CPU not involved" property). Only the sender's CQE fires.
  • RDMA READ — direction reverses. The local NIC sends a small READ request; the remote NIC DMA-reads its own Memory Region and sends the bytes back. The local NIC DMA-writes into local memory and posts a local CQE. Remote side never posts anything.

Queue Pairs — the "connection"

A QP is a pair of queues: Send Queue (SQ) and Receive Queue (RQ). The "pair" part is because every QP has both — even if you never use one. A QP on Host A pairs with a QP on Host B via an out-of-band exchange (typically over TCP) of their QP numbers, PSNs, and GIDs.

What's a GID? The Global Identifier is the RDMA L3 address — a 128-bit value (same shape as an IPv6 address). On RoCE v2 it's derived from the interface's IPv4 or IPv6 address (RoCE v2 carries a UDP/IP header on the wire, so the GID is the IP). On InfiniBand it's assigned by the Subnet Manager. The QP number is the L4 analog — closest to a TCP/UDP port. Together, (GID, QP num) uniquely identifies one endpoint of an RDMA connection.

Side-by-side comparison of TCP/IP and RDMA endpoint addressing. Left panel: TCP/IP endpoint = 32-bit IP address (e.g. 10.0.0.5) plus 16-bit port (e.g. 8080). Right panel: RDMA endpoint = 128-bit GID (IPv6-shaped, e.g. fe80::200:0a:fe5b:0c01) plus 24-bit QP number. On RoCE v2 the GID is the interface IP. On InfiniBand the GID is assigned by the Subnet Manager. QP number is per-NIC, not global.
Same shape as TCP — the GID gets the packet to the NIC, the QP num picks the right Queue Pair on that NIC. No kernel involved in routing.

QPs come in flavors:

QP typeReliable?Connection-oriented?Used for
RC (Reliable Connection)YesYes (1:1)Dominant in AI training. Most NCCL traffic.
UC (Unreliable Connection)NoYes (1:1)Rare. Mostly research.
UD (Unreliable Datagram)NoNo (1:N)Multicast, some HPC patterns.
RD (Reliable Datagram)YesNo (1:N)InfiniBand only. Niche.
XRC (eXtended RC)YesShared receiveMemory-efficient at scale. Some HPC.

For AI training fabrics, RC is what you'll see. NCCL opens an RC QP between every pair of GPUs that need to talk. With 8 GPUs per server and 1,000 servers, that's a lot of QPs — but ConnectX-7 supports millions per NIC, so it scales.


Memory Regions — the data path

Before a NIC can DMA into your memory, it has to know the memory is there and is allowed to be touched. This is memory registration.

When you register a memory region, three things happen:

  1. The kernel pins the pages (they can't be swapped out)
  2. The IOMMU is told the NIC can DMA-access these physical addresses
  3. The NIC returns two keys: lkey (local — used in send WRs to reference local buffers) and rkey (remote — given to the other side so they can RDMA-READ or RDMA-WRITE into your buffer)

Registration is slow (microseconds to milliseconds, depending on size). The app does it once at setup time, then reuses the MR for millions of operations. Pre-registering everything you'll ever touch is a standard pattern. NCCL pre-registers GPU HBM at job start; you never see a registration on the hot path.

The rkey is the access control. Whoever has your rkey can read/write that buffer remotely. Lose control of the rkey and you've handed someone direct memory access. (In practice, rkeys are short-lived and exchanged inside the trusted job.)


A complete worked example — RDMA WRITE

To send "hello world" from Host A to Host B with an RDMA WRITE:

  1. Both sides at setup:
    • Register an MR covering the buffer to write from (Host A) and the buffer to write into (Host B)
    • Exchange QP numbers, GIDs, PSNs, and Host B sends Host A its remote address + rkey — typically via librdmacm (covered on the next page) or a hand-rolled TCP socket
    • Transition both QPs through INIT → RTR → RTS states
QP state machine. A Queue Pair moves through RESET → INIT (local port and pkey set) → RTR Ready To Receive (peer QP number, GID, PSN now known) → RTS Ready To Send (local retry and timeout configured). ERR is reachable from any of the active states when something goes wrong. Both sides must reach RTS before data flows.
What `ibv_modify_qp` is doing under the hood. Setup gets a QP from RESET all the way to RTS. INIT → RTR is where the peer info plugs in; get the GID or QPN wrong and the QP stalls there.
  1. Host A posts the WR:

    ibv_post_send(qp, &wr, &bad_wr);

    The WR specifies: opcode = IBV_WR_RDMA_WRITE, local addr + lkey, remote addr + rkey, length.

  2. Host A NIC fires. Reads the WR, DMA-reads the payload from local MR, sends RDMA WRITE packets on the wire. Posts a CQE on Host A's CQ when complete.

  3. Host B NIC receives. Validates rkey. DMA-writes the payload into Host B's MR at the specified address. No CQE on Host B's CQ — it's a one-sided operation; Host B's app doesn't know it happened.

  4. Host A polls its CQ. Sees the completion. Considers the WRITE successful.

That's the entire thing. No send() on Host A. No recv() on Host B. The wire was talking to memory.


💡 What you should remember

#ConceptWhy it matters
1🔗QP = the connection (file descriptor analog)RC is the dominant type for AI training.
2📍MR = registered memory the NIC has direct DMA access toRegister once at setup; reuse forever.
3📨WR = the unit of workApp posts to a queue; NIC processes asynchronously.
4CQE = the completion eventApp polls the CQ to find out a WR finished.
5🌐GID = the RDMA L3 address (IPv6-shaped)QP number = the L4 port analog. (GID, QP num) identifies an endpoint.
6🔑One-sided RDMA (READ / WRITE) requires an rkey the remote side gave youWhoever has the rkey has direct memory access.
7🚪The hot path has no kernel involvementThe doorbell ring (MMIO) is the only kernel-adjacent operation, and it's a single 64-bit write.
8🔄QP setup goes through states — RESET → INIT → RTR → RTSStalls at INIT usually mean wrong peer GID/QPN.

Next: RDMA in Production — Reliability, Setup, Errors → — how RC reliability works in hardware, what librdmacm events you'll see in logs, the WR features that show up in NCCL traces, and the completion error codes operators triage.