Verbs, Queue Pairs, Memory Regions

You know sockets. socket(), bind(), listen(), connect(), send(), recv(). Connection-oriented, byte-stream, synchronous(-ish), kernel-mediated.

Verbs is the RDMA equivalent. It's the API every RDMA app programs against — same on InfiniBand, RoCE v1, RoCE v2, and iWARP. The vocabulary is different, the mental model is different, and the kernel mostly isn't in the path. Here's the translation.

The translation table

Sockets concept	Verbs equivalent
File descriptor	Queue Pair (QP) — represents a "connection"
`send()` / `recv()`	Work Request (WR) posted to a queue, processed by the NIC asynchronously
Socket buffer (in kernel)	Memory Region (MR) — registered in user space, NIC has direct DMA access
`recv()` returning bytes	Completion Queue Entry (CQE) posted to a Completion Queue (CQ)
TCP connection (3-way handshake, sequence numbers)	RC (Reliable Connection) mode QP, established via out-of-band exchange
UDP socket	UD (Unreliable Datagram) mode QP

A sockets app calls send() and waits for the kernel to do the work. A verbs app posts a work request to a queue and polls a completion queue later for the result. The NIC does the actual work in between. Asynchronous by design.

How one message moves

RDMA work request flow showing six steps. Host A sender: (1) app posts work request to the send queue, (2) NIC reads the WR, (3) NIC DMA-reads the payload from the registered memory region, (4) NIC sends across the wire. Host B receiver: (5) NIC DMA-writes the payload directly into the receiver's registered memory region, (6) a completion queue entry is posted for the app to poll. — One RDMA message, six steps. The CPU does only step 1 — post the work request. Everything else is the NIC.

The full sequence:

App posts a Work Request to a Send Queue (one MMIO write to the NIC's doorbell register). The WR includes: which Memory Region holds the payload, how big it is, what operation (SEND / READ / WRITE), and which remote address + rkey to target.
NIC reads the WR from the send queue. It's a small structure (~64 bytes).
NIC DMA-reads the payload from the Memory Region. The data never touches CPU caches.
NIC segments and transmits on the wire (RDMA-over-IB or RoCE v2 over UDP/IP, depending on transport).
Remote NIC receives the packets, reassembles, DMA-writes directly into the receiver's Memory Region.
Remote NIC posts a CQE to the Completion Queue. The receiving app polls the CQ and discovers the message arrived.

The CPU's involvement: post the WR (step 1), and optionally poll the CQ (after step 6). Everything in between is the NIC and PCIe.

Queue Pairs — the "connection"

A QP is a pair of queues: Send Queue (SQ) and Receive Queue (RQ). The "pair" part is because every QP has both — even if you never use one. A QP on Host A pairs with a QP on Host B via an out-of-band exchange (typically over TCP) of their QP numbers, PSNs, and GIDs.

QPs come in flavors:

QP type	Reliable?	Connection-oriented?	Used for
RC (Reliable Connection)	Yes	Yes (1:1)	Dominant in AI training. Most NCCL traffic.
UC (Unreliable Connection)	No	Yes (1:1)	Rare. Mostly research.
UD (Unreliable Datagram)	No	No (1:N)	Multicast, some HPC patterns.
RD (Reliable Datagram)	Yes	No (1:N)	InfiniBand only. Niche.
XRC (eXtended RC)	Yes	Shared receive	Memory-efficient at scale. Some HPC.

For AI training fabrics, RC is what you'll see. NCCL opens an RC QP between every pair of GPUs that need to talk. With 8 GPUs per server and 1,000 servers, that's a lot of QPs — but ConnectX-7 supports millions per NIC, so it scales.

Memory Regions — the data path

Before a NIC can DMA into your memory, it has to know the memory is there and is allowed to be touched. This is memory registration.

When you register a memory region, three things happen:

The kernel pins the pages (they can't be swapped out)
The IOMMU is told the NIC can DMA-access these physical addresses
The NIC returns two keys: lkey (local — used in send WRs to reference local buffers) and rkey (remote — given to the other side so they can RDMA-READ or RDMA-WRITE into your buffer)

Registration is slow (microseconds to milliseconds, depending on size). The app does it once at setup time, then reuses the MR for millions of operations. Pre-registering everything you'll ever touch is a standard pattern. NCCL pre-registers GPU HBM at job start; you never see a registration on the hot path.

The rkey is the access control. Whoever has your rkey can read/write that buffer remotely. Lose control of the rkey and you've handed someone direct memory access. (In practice, rkeys are short-lived and exchanged inside the trusted job.)

A complete worked example — RDMA WRITE

To send "hello world" from Host A to Host B with an RDMA WRITE:

Both sides at setup:
- Register an MR covering the buffer to write from (Host A) and the buffer to write into (Host B)
- Out-of-band exchange: QP numbers, GIDs, PSNs, and Host B sends Host A its remote address + rkey
- Transition both QPs through INIT → RTR → RTS states
Host A posts the WR:
```
ibv_post_send(qp, &wr, &bad_wr);
```
The WR specifies: opcode = IBV_WR_RDMA_WRITE, local addr + lkey, remote addr + rkey, length.
Host A NIC fires. Reads the WR, DMA-reads the payload from local MR, sends RDMA WRITE packets on the wire. Posts a CQE on Host A's CQ when complete.
Host B NIC receives. Validates rkey. DMA-writes the payload into Host B's MR at the specified address. No CQE on Host B's CQ — it's a one-sided operation; Host B's app doesn't know it happened.
Host A polls its CQ. Sees the completion. Considers the WRITE successful.

That's the entire thing. No send() on Host A. No recv() on Host B. The wire was talking to memory.

What you should remember

QP = the connection (file descriptor analog). RC is the dominant type for AI training.
MR = registered memory the NIC has direct DMA access to. Register once at setup; reuse forever.
WR = the unit of work. App posts to a queue; NIC processes asynchronously.
CQE = the completion event. App polls the CQ to find out a WR finished.
One-sided RDMA (READ / WRITE) requires an rkey the remote side gave you. Whoever has the rkey has direct memory access.
The hot path has no kernel involvement — the doorbell ring (MMIO) is the only kernel-adjacent operation, and it's a single 64-bit write.

Next section: InfiniBand → — the native RDMA fabric. Then RoCE v2 — the same IB transport on commodity Ethernet, and the fabric this curriculum picks.

The translation table​

How one message moves​

Queue Pairs — the "connection"​

Memory Regions — the data path​

A complete worked example — RDMA WRITE​

What you should remember​