RDMA in Production — Reliability, Setup, Errors
You know the API from the previous page. Now the operational layer — what the NIC is actually doing under the hood, what you'll see in logs, and what breaks.
This page is for the network engineer who has to debug an RDMA cluster, not just understand the concepts.
How RC makes things reliable
"Reliable Connection" gets used as if it means the same thing as TCP. The mechanisms are similar but the implementation lives in hardware, not the kernel:
- PSN (Packet Sequence Number) — every packet within a QP carries a 24-bit PSN. The receiver tracks the next expected PSN.
- ACK / NAK — the receiving NIC sends ACKs for received packets. If the PSN jumps (gap detected), it returns a NAK so the sender retransmits from there.
- Retransmit count + timeout — if the sender's NIC doesn't hear an ACK within the configured timeout, it retransmits. After N retries (
retry_count, typically 7), it gives up and reportsIBV_WC_RETRY_EXC_ERRto the app. - RNR (Receiver Not Ready) — if a SEND arrives but no RECV WR is pre-posted, the receiver returns RNR. The sender waits
rnr_timerand retries. Afterrnr_retryfailures, it reportsIBV_WC_RNR_RETRY_EXC_ERR.
Why this matters for RoCE v2: in InfiniBand, the fabric is essentially lossless and these retransmit paths almost never fire. In RoCE v2 over Ethernet, the fabric can drop a packet (congestion, bit error, buffer overrun) and RC's PSN-gap retransmit is the safety net. PFC + ECN exist precisely so this safety net is rarely tested — when the fabric drops packets, RoCE v2 performance falls off a cliff because retransmits are coarse-grained (go-back-N, not selective).
The on-call takeaway: if you start seeing IBV_WC_RETRY_EXC_ERR in app logs, the first place to look is switch counters (drops, PFC pauses, ECN marks) — not the NIC, not the app. The fabric leaked a packet that PFC was supposed to backpressure.
Connection setup with librdmacm — what operators see
Before any data flows, RC QPs have to be wired up. There's a chicken-and-egg problem — you need the peer's GID and QP number to set up the QP, but you need a working connection to exchange them. The solution: bootstrap over a separate channel.
The de-facto standard is librdmacm (the RDMA Connection Manager). It looks like sockets — rdma_create_id, rdma_resolve_addr, rdma_resolve_route, rdma_connect, rdma_accept — but operates over the RDMA fabric and produces the GID/QP-num exchange as a byproduct.
The actual event sequence:
| Event | When | Common failure |
|---|---|---|
RDMA_CM_EVENT_ADDR_RESOLVED | After rdma_resolve_addr() finds the local NIC for the destination IP | DNS / routing problem; no IB device for this IP |
RDMA_CM_EVENT_ROUTE_RESOLVED | After rdma_resolve_route() finds a path to the peer | Subnet/GID problem; ARP-equivalent neighbor lookup failed |
RDMA_CM_EVENT_CONNECT_REQUEST | Server side, when a client calls rdma_connect() | (informational) |
RDMA_CM_EVENT_ESTABLISHED | Both sides, after the QP is wired up and in RTS state | (happy path) |
RDMA_CM_EVENT_REJECTED | Server rejected the connection | Wrong QP params, version mismatch, app-level reject |
RDMA_CM_EVENT_UNREACHABLE | Path to peer can't be established | Fabric down, peer down, MTU mismatch |
RDMA_CM_EVENT_DISCONNECTED | Connection went away after being established | Peer crashed, link flap exceeded retry window |
RDMA_CM_EVENT_DEVICE_REMOVAL | The local RDMA device went away (driver unload, hot remove) | Maintenance, driver crash |
Operator reflex:
UNREACHABLEat connect time → almost always a fabric problem (BGP, routing, MTU)REJECTED→ application-level — version mismatch, capability mismatch, intentional rejectDISCONNECTEDmid-run → start with link counters, then app logs. Could be either side.
NCCL skips librdmacm and uses its own TCP-based bootstrap socket. But every other RDMA app — ib_write_bw, mpirun, storage clients — uses librdmacm and you'll see these events in their logs.
WR features you'll see in production traces
Once past the basics, a handful of WR-level features show up constantly in NCCL traces, dmesg, and tuning guides. Worth knowing what they mean without having to write the code.
Scatter-Gather Entries (SGE)
A single WR can reference up to ~30 non-contiguous buffers via an SGE list. The NIC gathers (on send) or scatters (on receive) into all of them as one logical message.
AI workloads lean on this constantly — a gradient tensor may be split across multiple HBM regions (different layers, different micro-batches), and one WR gathers them into one wire message. The receive side scatters into the matching layout.
Inline data
For small payloads (typically ≤256 bytes), the bytes can be embedded directly in the WR itself instead of pointed-to via SGE. No DMA-read step → lower latency.
Used for: control messages, NCCL handshakes, the "I'm done" signals at the end of a collective. Anything where the message is small enough that the extra DMA round-trip to fetch the payload dominates wire time.
Signaled vs unsignaled WRs
By default every WR generates a CQE on completion. With the IBV_SEND_SIGNALED flag off (unsignaled mode), the NIC skips the CQE entirely.
Apps typically signal every Nth WR — say, every 16th. The signaled WR's completion implies all earlier unsignaled ones in the same QP also completed (because the QP is ordered). Halves the CQE-polling cost on the hot path, which matters at 200M-pps message rates.
IBV_WR_RDMA_WRITE_WITH_IMM
An RDMA WRITE that does generate a CQE on the receiver, carrying a 32-bit "immediate" value. Gives you one-sided WRITE performance with two-sided notification.
NCCL uses this to combine "data delivered" + "go" signal into one wire op. The 32-bit immediate carries a small piece of metadata (chunk ID, step number, etc.) that the receiver needs without an extra round trip.
IBV_WR_SEND_WITH_IMM
Same idea for two-sided SEND. The 32 bits land in the receiver's CQE alongside the normal completion.
The takeaway: you don't have to program these. You do have to recognize them when you see them in dmesg, in NCCL debug output, or in perfquery traces — and understand what behavior they imply.
Completion errors you'll actually see
When something goes wrong, ibv_poll_cq returns a Work Completion with a non-success status. Most production incidents land on one of these:
| Status code | What it means | Common cause |
|---|---|---|
IBV_WC_SUCCESS | The op finished cleanly | (the happy path) |
IBV_WC_RETRY_EXC_ERR | Sender retried retry_count times, never got an ACK | Fabric dropping packets, remote NIC unresponsive, link down, severe congestion that PFC isn't masking |
IBV_WC_RNR_RETRY_EXC_ERR | Sender retried rnr_retry times; receiver never had a RECV WR posted | Application bug — receiver fell behind on posting RECVs |
IBV_WC_LOC_PROT_ERR | Local protection error | SGE points outside a registered MR, or wrong lkey |
IBV_WC_REM_ACCESS_ERR | Remote rejected the WRITE/READ | rkey wrong, or permissions don't allow that op on the remote MR |
IBV_WC_WR_FLUSH_ERR | This WR was flushed because an earlier WR failed | Cascade — find the first error in the CQ, that's the real one |
IBV_WC_LOC_LEN_ERR | Local length error | SGE length wrong, or message exceeded MTU × max segments |
Operator triage decision tree
RETRY_EXC_ERR→ fabric problem. Check switch drops, PFC pauses, ECN marks, link error counters. Check both directions.RNR_RETRY_EXC_ERR→ application problem. The remote app isn't posting RECVs fast enough. Network is innocent.REM_ACCESS_ERR→ rkey or permissions. Usually a setup/orchestration bug (wrong rkey shared, MR registered with insufficient permissions).LOC_PROT_ERR/LOC_LEN_ERR→ local application bug. Bad SGE pointer or wrong length.WR_FLUSH_ERR→ noise. Find the real error earlier in the same CQ; everything after it gets flushed.
What you should remember
- RC reliability is hardware — PSN, ACK/NAK, retransmit, RNR. Mirrors TCP, runs in the NIC. Drops in RoCE v2 →
RETRY_EXC_ERR. librdmacmevents tell you where setup broke —UNREACHABLE≈ fabric,REJECTED≈ app config,DISCONNECTED≈ peer/link gone.- SGE, inline data, signaled/unsignaled, WRITE_WITH_IMM — features you'll see in NCCL traces and tuning docs. Recognize them; you don't have to write them.
RETRY_EXC_ERR= fabric problem. Start with switch counters before blaming the NIC or the app.RNR_RETRY_EXC_ERR= app problem. The network is innocent; the remote app fell behind on posting RECVs.WR_FLUSH_ERRis noise — look upstream in the same CQ for the real error.
Next section: InfiniBand → — the native RDMA fabric. Then RoCE v2 — the same IB transport on commodity Ethernet, and the fabric this curriculum picks.