What RDMA Actually Does
RDMA stands for Remote Direct Memory Access. Three words, three ideas:
- Memory Access — reading and writing memory. Every program does this on every line.
*ptr = 5is a memory write.int x = *ptris a memory read. Normal stuff. - Direct — the NIC does it directly, hardware-to-hardware. The CPU still initiates each operation (it tells the NIC "go" with a single 64-bit doorbell write), but it never touches the data that's being moved.
- Remote — the memory being read or written lives on another machine, across the network.
Put it together: the NIC on Host A reads and writes the memory of Host B directly, over the wire, without either CPU shoveling the data. That is the whole idea.
Important: RDMA is a technique, not a protocol or transport. It describes an approach — let the NIC read and write remote memory directly — that's implemented by two different wire protocols: InfiniBand (a separate fabric, not Ethernet) and RoCE v2 (RDMA wrapped in standard Ethernet). Applications see the same verbs API on both; only the bytes-on-the-wire differ. Each fabric gets its own section in this curriculum: InfiniBand and RoCE v2.
If this sounds unusual, it is. For 40 years of network engineering, the rule was "data goes through the kernel; the kernel mediates everything." RDMA breaks that rule. The reason it exists is simple — at modern AI-fabric speeds (400 Gbps and climbing toward 800 and 1.6 Tbps), the CPU literally cannot keep up. So we bypass it.
This page explains what RDMA does mechanically. The next page covers the API the application programs against (verbs, queue pairs, memory regions).
The TCP path (left side)
Sending one packet from Host A's app to Host B's app, the way you've done it for 20 years: send() triggers a syscall, the kernel copies the data into a socket buffer, the TCP/IP stack runs (checksum, sequence numbers, retransmit tracking), the driver writes a DMA descriptor, the NIC finally pushes bytes on the wire. Mirror that whole thing on the receive side.
Every step is a CPU tax. Context switch, memory copy, protocol processing, driver overhead. At 1 Gbps nobody cared. At 400 Gbps with 50+ million packets per second per NIC, the CPU runs out before the wire does. Doesn't matter how many cores you have — software can't keep up.
That's the problem. RDMA is the answer.
The RDMA path (right side)
The app pre-registers a memory region with the NIC (one-time setup at job start). Then for every message: the app posts a Work Request to a Send Queue (one MMIO write — a single 64-bit store), the NIC reads the WR, DMA-reads the payload from the registered memory region, and sends it on the wire. The receiving NIC writes directly into the remote app's registered memory region. A Completion Queue Entry is posted on both sides so the apps know it happened.
The CPU does two things: register memory once (setup) and post work requests (one MMIO write per message). The NIC does everything else — reading the payload, segmenting it, sending it on the wire, receiving on the other side, writing it to remote memory, and posting the CQE.
No kernel. No context switch. No memcpy. No syscall per packet. The NIC and the app's memory are in a direct conversation.
The three operations
RDMA has three main operations. The difference between them is who's involved on each side. Here's the at-a-glance comparison — with TCP shown first as the baseline you already know:
Pick the tab for the one you want to understand:
- 1. SEND / RECV — two-sided
- 2. READ — one-sided pull
- 3. WRITE — one-sided push
The classic message-passing model. The receiver has to post a receive buffer ahead of time, telling the NIC "when something comes in, put it here." The sender then does a SEND. The NIC delivers into the pre-posted buffer and signals completion on both sides.
Who's involved: Both CPUs — sender posts a SEND, receiver posts a RECV. Closest to traditional send() / recv().
Where you see it: Control messages, handshake exchanges, the "I'm done" signal at the end of a collective. Not the bulk gradient traffic during training.
The local NIC reads remote memory directly. The remote CPU is not involved at all — no interrupt, no completion event. The local NIC pulls the data over and delivers it to the local app via the CQ.
"Hey remote NIC, give me 64 KB starting at address X, key Y" — that's the entire RDMA READ.
Who's involved: Only the local CPU posts. Remote CPU never knows it happened.
Where you see it: Pulling data the remote side doesn't need to know about. Common in storage, less so in AI training.
The mirror of READ. The local NIC writes to remote memory directly. Remote CPU not involved. Local app posts the WR, the NIC handles the rest, the data appears in remote memory.
"Hey remote NIC, here's 64 KB, put it at address X, key Y" — that's the entire RDMA WRITE.
Who's involved: Only the local CPU posts. Remote CPU never knows it happened.
Where you see it: everywhere in AI training. Every GPU writes gradient fragments directly into other GPUs' memory during AllReduce. No remote CPU spinning, no notification overhead. The receiving GPU's CPU literally doesn't know it received anything until it checks the memory.
The two one-sided operations (READ, WRITE) are why AI training works. No CPU on the receive side means no remote CPU budget consumed for the bulk data path — which is what makes 400 Gbps × thousands of NICs possible.
What the CPU actually does
When network engineers first hear "RDMA bypasses the CPU," they reasonably ask: if so, why is the CPU shown active in the diagram above? The answer is in the granularity. The CPU is involved at three specific moments per operation — and that's it:
| When | What | How often | How much CPU |
|---|---|---|---|
| Setup | Register memory region with the NIC (pin pages, get lkey / rkey) | Once per region, at job start | Slow (μs–ms) — but one-time |
| Initiate | Post a Work Request — a single 64-bit MMIO write to the NIC's doorbell | Once per message | ~10–50 ns |
| Complete (optional) | Poll the Completion Queue, or wait for an interrupt | Once per message | ~10 ns per CQE |
What the CPU does NOT do:
- Segment / reassemble packets — the NIC does it
- Generate Ethernet / IP / UDP / BTH headers — the NIC does it
- DMA the payload — the NIC does it, directly from registered memory
- Handle ACKs, retransmits, sequence numbers — the NIC does it
So CPU involvement is per message, not per packet. That's the key asymmetry: at 400 Gbps with ~50 million packets per second, a per-packet CPU cost would crush even a 64-core machine. A per-message cost — where one message can carry many megabytes — is bounded and tiny. Even during an AllReduce with thousands of WRs per training step, the CPU spends only a few microseconds total on RDMA initiation. The rest of the time it's running the training code.
This asymmetry is the entire reason RDMA exists.
Why this matters at AI scale
A modern AI training step looks like this:
8,000 GPUs, all doing AllReduce, all writing fragments of gradients into each other's HBM at 400 Gbps per NIC, simultaneously, for ~2 seconds, every ~5 seconds, for weeks.
There is no CPU budget for that. None. The CPU is busy running the training code (Python, CUDA dispatch, kernel launches). The network has to happen without the CPU. RDMA is the only way that math works.
This is also why GPUDirect RDMA matters: it lets the NIC read/write GPU HBM directly (not just host memory), so the data never even hops through CPU DRAM. The NIC and the GPU are in a direct conversation across PCIe.
What you should remember
- RDMA is a technique, not a protocol — implemented by InfiniBand and RoCE v2 on the wire.
- The NIC reads/writes remote memory directly. No kernel in the path; no syscall per packet.
- Two-sided (SEND/RECV) = both CPUs post buffers. Closest to traditional sockets.
- One-sided (READ/WRITE) = remote CPU not involved. The wire is talking to memory.
- CPU involvement is per-message, not per-packet — that's the asymmetry that makes RDMA scale.
- GPUDirect RDMA lets the NIC talk directly to GPU memory (HBM), skipping the CPU's DRAM entirely.
Next: Verbs, Queue Pairs, Memory Regions → — the API the app actually programs against, and how queue pairs and memory regions fit together.