Skip to main content

RDMA in Production — Reliability, Setup, Errors

You know the API from the previous page. Now the operational layer — what the NIC is actually doing under the hood, what you'll see in logs, and what breaks.

This page is for the network engineer who has to debug an RDMA cluster, not just understand the concepts.


How RC makes things reliable

"Reliable Connection" gets used as if it means the same thing as TCP. The mechanisms are similar but the implementation lives in hardware, not the kernel:

  • PSN (Packet Sequence Number) — every packet within a QP carries a 24-bit PSN. The receiver tracks the next expected PSN.
  • ACK / NAK — the receiving NIC sends ACKs for received packets. If the PSN jumps (gap detected), it returns a NAK so the sender retransmits from there.
  • Retransmit count + timeout — if the sender's NIC doesn't hear an ACK within the configured timeout, it retransmits. After N retries (retry_count, typically 7), it gives up and reports IBV_WC_RETRY_EXC_ERR to the app.
  • RNR (Receiver Not Ready) — if a SEND arrives but no RECV WR is pre-posted, the receiver returns RNR. The sender waits rnr_timer and retries. After rnr_retry failures, it reports IBV_WC_RNR_RETRY_EXC_ERR.
Sequence diagram of RC reliability under packet loss. Sender NIC transmits packets with PSN 1, 2, 3, 4. PSN 2 is dropped on the wire. Receiver gets PSN 3 and detects the PSN-2 gap, returns a NAK referencing PSN 2. Sender's NIC retransmits PSN 2 then PSN 3. Receiver returns a cumulative ACK for everything up to PSN 3, then a regular ACK for PSN 4. The whole exchange happens between the two NICs without CPU involvement.
PSN tracking, NAK on gap, retransmit — all in NIC hardware. After `retry_count` retries with no ACK, the sender's WR completes with `IBV_WC_RETRY_EXC_ERR`.

Why this matters for RoCE v2: in InfiniBand, the fabric is essentially lossless and these retransmit paths almost never fire. In RoCE v2 over Ethernet, the fabric can drop a packet (congestion, bit error, buffer overrun) and RC's PSN-gap retransmit is the safety net. PFC + ECN exist precisely so this safety net is rarely tested — when the fabric drops packets, RoCE v2 performance falls off a cliff because retransmits are coarse-grained (go-back-N, not selective).

The on-call takeaway: if you start seeing IBV_WC_RETRY_EXC_ERR in app logs, the first place to look is switch counters (drops, PFC pauses, ECN marks) — not the NIC, not the app. The fabric leaked a packet that PFC was supposed to backpressure.


Connection setup with librdmacm — what operators see

Before any data flows, RC QPs have to be wired up. There's a chicken-and-egg problem — you need the peer's GID and QP number to set up the QP, but you need a working connection to exchange them. The solution: bootstrap over a separate channel.

The de-facto standard is librdmacm (the RDMA Connection Manager). It looks like sockets — rdma_create_id, rdma_resolve_addr, rdma_resolve_route, rdma_connect, rdma_accept — but operates over the RDMA fabric and produces the GID/QP-num exchange as a byproduct.

Swim-lane sequence diagram of librdmacm connection setup. Server calls rdma_listen() first. Client calls rdma_resolve_addr() and receives ADDR_RESOLVED, then rdma_resolve_route() and receives ROUTE_RESOLVED, then rdma_connect() which sends a CM REQ over the wire. Server gets CONNECT_REQUEST, calls rdma_accept(), CM REP goes back. Both sides see ESTABLISHED — QPs are RTS and data can flow. Failure events shown at bottom: UNREACHABLE (fabric/routing), REJECTED (server said no), DISCONNECTED (peer or link gone).
Each green box is an event that lands in your CM event channel — and usually in dmesg or app logs too. Knowing which event fired tells you where setup broke.

The actual event sequence:

EventWhenCommon failure
RDMA_CM_EVENT_ADDR_RESOLVEDAfter rdma_resolve_addr() finds the local NIC for the destination IPDNS / routing problem; no IB device for this IP
RDMA_CM_EVENT_ROUTE_RESOLVEDAfter rdma_resolve_route() finds a path to the peerSubnet/GID problem; ARP-equivalent neighbor lookup failed
RDMA_CM_EVENT_CONNECT_REQUESTServer side, when a client calls rdma_connect()(informational)
RDMA_CM_EVENT_ESTABLISHEDBoth sides, after the QP is wired up and in RTS state(happy path)
RDMA_CM_EVENT_REJECTEDServer rejected the connectionWrong QP params, version mismatch, app-level reject
RDMA_CM_EVENT_UNREACHABLEPath to peer can't be establishedFabric down, peer down, MTU mismatch
RDMA_CM_EVENT_DISCONNECTEDConnection went away after being establishedPeer crashed, link flap exceeded retry window
RDMA_CM_EVENT_DEVICE_REMOVALThe local RDMA device went away (driver unload, hot remove)Maintenance, driver crash

Operator reflex:

  • UNREACHABLE at connect time → almost always a fabric problem (BGP, routing, MTU)
  • REJECTED → application-level — version mismatch, capability mismatch, intentional reject
  • DISCONNECTED mid-run → start with link counters, then app logs. Could be either side.

NCCL skips librdmacm and uses its own TCP-based bootstrap socket. But every other RDMA app — ib_write_bw, mpirun, storage clients — uses librdmacm and you'll see these events in their logs.


WR features you'll see in production traces

Once past the basics, a handful of WR-level features show up constantly in NCCL traces, dmesg, and tuning guides. Worth knowing what they mean without having to write the code.

Scatter-Gather Entries (SGE)

A single WR can reference up to ~30 non-contiguous buffers via an SGE list. The NIC gathers (on send) or scatters (on receive) into all of them as one logical message.

AI workloads lean on this constantly — a gradient tensor may be split across multiple HBM regions (different layers, different micro-batches), and one WR gathers them into one wire message. The receive side scatters into the matching layout.

Inline data

For small payloads (typically ≤256 bytes), the bytes can be embedded directly in the WR itself instead of pointed-to via SGE. No DMA-read step → lower latency.

Used for: control messages, NCCL handshakes, the "I'm done" signals at the end of a collective. Anything where the message is small enough that the extra DMA round-trip to fetch the payload dominates wire time.

Signaled vs unsignaled WRs

By default every WR generates a CQE on completion. With the IBV_SEND_SIGNALED flag off (unsignaled mode), the NIC skips the CQE entirely.

Apps typically signal every Nth WR — say, every 16th. The signaled WR's completion implies all earlier unsignaled ones in the same QP also completed (because the QP is ordered). Halves the CQE-polling cost on the hot path, which matters at 200M-pps message rates.

IBV_WR_RDMA_WRITE_WITH_IMM

An RDMA WRITE that does generate a CQE on the receiver, carrying a 32-bit "immediate" value. Gives you one-sided WRITE performance with two-sided notification.

NCCL uses this to combine "data delivered" + "go" signal into one wire op. The 32-bit immediate carries a small piece of metadata (chunk ID, step number, etc.) that the receiver needs without an extra round trip.

IBV_WR_SEND_WITH_IMM

Same idea for two-sided SEND. The 32 bits land in the receiver's CQE alongside the normal completion.

The takeaway: you don't have to program these. You do have to recognize them when you see them in dmesg, in NCCL debug output, or in perfquery traces — and understand what behavior they imply.


Completion errors you'll actually see

When something goes wrong, ibv_poll_cq returns a Work Completion with a non-success status. Most production incidents land on one of these:

Status codeWhat it meansCommon cause
IBV_WC_SUCCESSThe op finished cleanly(the happy path)
IBV_WC_RETRY_EXC_ERRSender retried retry_count times, never got an ACKFabric dropping packets, remote NIC unresponsive, link down, severe congestion that PFC isn't masking
IBV_WC_RNR_RETRY_EXC_ERRSender retried rnr_retry times; receiver never had a RECV WR postedApplication bug — receiver fell behind on posting RECVs
IBV_WC_LOC_PROT_ERRLocal protection errorSGE points outside a registered MR, or wrong lkey
IBV_WC_REM_ACCESS_ERRRemote rejected the WRITE/READrkey wrong, or permissions don't allow that op on the remote MR
IBV_WC_WR_FLUSH_ERRThis WR was flushed because an earlier WR failedCascade — find the first error in the CQ, that's the real one
IBV_WC_LOC_LEN_ERRLocal length errorSGE length wrong, or message exceeded MTU × max segments

Operator triage decision tree

  • RETRY_EXC_ERR → fabric problem. Check switch drops, PFC pauses, ECN marks, link error counters. Check both directions.
  • RNR_RETRY_EXC_ERR → application problem. The remote app isn't posting RECVs fast enough. Network is innocent.
  • REM_ACCESS_ERR → rkey or permissions. Usually a setup/orchestration bug (wrong rkey shared, MR registered with insufficient permissions).
  • LOC_PROT_ERR / LOC_LEN_ERR → local application bug. Bad SGE pointer or wrong length.
  • WR_FLUSH_ERR → noise. Find the real error earlier in the same CQ; everything after it gets flushed.

What you should remember

  • RC reliability is hardware — PSN, ACK/NAK, retransmit, RNR. Mirrors TCP, runs in the NIC. Drops in RoCE v2 → RETRY_EXC_ERR.
  • librdmacm events tell you where setup brokeUNREACHABLE ≈ fabric, REJECTED ≈ app config, DISCONNECTED ≈ peer/link gone.
  • SGE, inline data, signaled/unsignaled, WRITE_WITH_IMM — features you'll see in NCCL traces and tuning docs. Recognize them; you don't have to write them.
  • RETRY_EXC_ERR = fabric problem. Start with switch counters before blaming the NIC or the app.
  • RNR_RETRY_EXC_ERR = app problem. The network is innocent; the remote app fell behind on posting RECVs.
  • WR_FLUSH_ERR is noise — look upstream in the same CQ for the real error.

Next section: InfiniBand → — the native RDMA fabric. Then RoCE v2 — the same IB transport on commodity Ethernet, and the fabric this curriculum picks.