RDMA in Production — Reliability, Setup, Errors

You know the API from the previous page. Now the operational layer — what the NIC is actually doing under the hood, what you'll see in logs, and what breaks.

This page is for the network engineer who has to debug an RDMA cluster, not just understand the concepts.

How RC makes things reliable

"Reliable Connection" gets used as if it means the same thing as TCP. The mechanisms are similar but the implementation lives in hardware, not the kernel:

PSN (Packet Sequence Number) — every packet within a QP carries a 24-bit PSN. The receiver tracks the next expected PSN.
ACK / NAK — the receiving NIC sends ACKs for received packets. If the PSN jumps (gap detected), it returns a NAK so the sender retransmits from there.
Retransmit count + timeout — if the sender's NIC doesn't hear an ACK within the configured timeout, it retransmits. After N retries (retry_count, typically 7), it gives up and reports IBV_WC_RETRY_EXC_ERR to the app.
RNR (Receiver Not Ready) — if a SEND arrives but no RECV WR is pre-posted, the receiver returns RNR. The sender waits rnr_timer and retries. After rnr_retry failures, it reports IBV_WC_RNR_RETRY_EXC_ERR.

Sequence diagram of RC reliability under packet loss. Sender NIC transmits packets with PSN 1, 2, 3, 4. PSN 2 is dropped on the wire. Receiver gets PSN 3 and detects the PSN-2 gap, returns a NAK referencing PSN 2. Sender's NIC retransmits PSN 2 then PSN 3. Receiver returns a cumulative ACK for everything up to PSN 3, then a regular ACK for PSN 4. The whole exchange happens between the two NICs without CPU involvement. — PSN tracking, NAK on gap, retransmit — all in NIC hardware. After `retry_count` retries with no ACK, the sender's WR completes with `IBV_WC_RETRY_EXC_ERR`.

Why this matters for RoCE v2: in InfiniBand, the fabric is essentially lossless and these retransmit paths almost never fire. In RoCE v2 over Ethernet, the fabric can drop a packet (congestion, bit error, buffer overrun) and RC's PSN-gap retransmit is the safety net. PFC + ECN exist precisely so this safety net is rarely tested — when the fabric drops packets, RoCE v2 performance falls off a cliff because retransmits are coarse-grained (go-back-N, not selective).

The on-call takeaway: if you start seeing IBV_WC_RETRY_EXC_ERR in app logs, the first place to look is switch counters (drops, PFC pauses, ECN marks) — not the NIC, not the app. The fabric leaked a packet that PFC was supposed to backpressure.

Connection setup with librdmacm — what operators see

Before any data flows, RC QPs have to be wired up. There's a chicken-and-egg problem — you need the peer's GID and QP number to set up the QP, but you need a working connection to exchange them. The solution: bootstrap over a separate channel.

The de-facto standard is librdmacm (the RDMA Connection Manager). It looks like sockets — rdma_create_id, rdma_resolve_addr, rdma_resolve_route, rdma_connect, rdma_accept — but operates over the RDMA fabric and produces the GID/QP-num exchange as a byproduct.

Swim-lane sequence diagram of librdmacm connection setup. Server calls rdma_listen() first. Client calls rdma_resolve_addr() and receives ADDR_RESOLVED, then rdma_resolve_route() and receives ROUTE_RESOLVED, then rdma_connect() which sends a CM REQ over the wire. Server gets CONNECT_REQUEST, calls rdma_accept(), CM REP goes back. Both sides see ESTABLISHED — QPs are RTS and data can flow. Failure events shown at bottom: UNREACHABLE (fabric/routing), REJECTED (server said no), DISCONNECTED (peer or link gone). — Each green box is an event that lands in your CM event channel — and usually in dmesg or app logs too. Knowing which event fired tells you where setup broke.

The actual event sequence:

Event	When	Common failure
`RDMA_CM_EVENT_ADDR_RESOLVED`	After `rdma_resolve_addr()` finds the local NIC for the destination IP	DNS / routing problem; no IB device for this IP
`RDMA_CM_EVENT_ROUTE_RESOLVED`	After `rdma_resolve_route()` finds a path to the peer	Subnet/GID problem; ARP-equivalent neighbor lookup failed
`RDMA_CM_EVENT_CONNECT_REQUEST`	Server side, when a client calls `rdma_connect()`	(informational)
`RDMA_CM_EVENT_ESTABLISHED`	Both sides, after the QP is wired up and in RTS state	(happy path)
`RDMA_CM_EVENT_REJECTED`	Server rejected the connection	Wrong QP params, version mismatch, app-level reject
`RDMA_CM_EVENT_UNREACHABLE`	Path to peer can't be established	Fabric down, peer down, MTU mismatch
`RDMA_CM_EVENT_DISCONNECTED`	Connection went away after being established	Peer crashed, link flap exceeded retry window
`RDMA_CM_EVENT_DEVICE_REMOVAL`	The local RDMA device went away (driver unload, hot remove)	Maintenance, driver crash

Operator reflex:

UNREACHABLE at connect time → almost always a fabric problem (BGP, routing, MTU)
REJECTED → application-level — version mismatch, capability mismatch, intentional reject
DISCONNECTED mid-run → start with link counters, then app logs. Could be either side.

NCCL skips librdmacm and uses its own TCP-based bootstrap socket. But every other RDMA app — ib_write_bw, mpirun, storage clients — uses librdmacm and you'll see these events in their logs.

WR features you'll see in production traces

Once past the basics, a handful of WR-level features show up constantly in NCCL traces, dmesg, and tuning guides. Worth knowing what they mean without having to write the code.

Scatter-Gather Entries (SGE)

A single WR can reference up to ~30 non-contiguous buffers via an SGE list. The NIC gathers (on send) or scatters (on receive) into all of them as one logical message.

AI workloads lean on this constantly — a gradient tensor may be split across multiple HBM regions (different layers, different micro-batches), and one WR gathers them into one wire message. The receive side scatters into the matching layout.

Inline data

For small payloads (typically ≤256 bytes), the bytes can be embedded directly in the WR itself instead of pointed-to via SGE. No DMA-read step → lower latency.

Used for: control messages, NCCL handshakes, the "I'm done" signals at the end of a collective. Anything where the message is small enough that the extra DMA round-trip to fetch the payload dominates wire time.

Signaled vs unsignaled WRs

By default every WR generates a CQE on completion. With the IBV_SEND_SIGNALED flag off (unsignaled mode), the NIC skips the CQE entirely.

Apps typically signal every Nth WR — say, every 16th. The signaled WR's completion implies all earlier unsignaled ones in the same QP also completed (because the QP is ordered). Halves the CQE-polling cost on the hot path, which matters at 200M-pps message rates.

`IBV_WR_RDMA_WRITE_WITH_IMM`

An RDMA WRITE that does generate a CQE on the receiver, carrying a 32-bit "immediate" value. Gives you one-sided WRITE performance with two-sided notification.

NCCL uses this to combine "data delivered" + "go" signal into one wire op. The 32-bit immediate carries a small piece of metadata (chunk ID, step number, etc.) that the receiver needs without an extra round trip.

`IBV_WR_SEND_WITH_IMM`

Same idea for two-sided SEND. The 32 bits land in the receiver's CQE alongside the normal completion.

The takeaway: you don't have to program these. You do have to recognize them when you see them in dmesg, in NCCL debug output, or in perfquery traces — and understand what behavior they imply.

Completion errors you'll actually see

When something goes wrong, ibv_poll_cq returns a Work Completion with a non-success status. Most production incidents land on one of these:

Status code	What it means	Common cause
`IBV_WC_SUCCESS`	The op finished cleanly	(the happy path)
`IBV_WC_RETRY_EXC_ERR`	Sender retried `retry_count` times, never got an ACK	Fabric dropping packets, remote NIC unresponsive, link down, severe congestion that PFC isn't masking
`IBV_WC_RNR_RETRY_EXC_ERR`	Sender retried `rnr_retry` times; receiver never had a RECV WR posted	Application bug — receiver fell behind on posting RECVs
`IBV_WC_LOC_PROT_ERR`	Local protection error	SGE points outside a registered MR, or wrong lkey
`IBV_WC_REM_ACCESS_ERR`	Remote rejected the WRITE/READ	rkey wrong, or permissions don't allow that op on the remote MR
`IBV_WC_WR_FLUSH_ERR`	This WR was flushed because an earlier WR failed	Cascade — find the first error in the CQ, that's the real one
`IBV_WC_LOC_LEN_ERR`	Local length error	SGE length wrong, or message exceeded MTU × max segments

Operator triage decision tree

RETRY_EXC_ERR → fabric problem. Check switch drops, PFC pauses, ECN marks, link error counters. Check both directions.
RNR_RETRY_EXC_ERR → application problem. The remote app isn't posting RECVs fast enough. Network is innocent.
REM_ACCESS_ERR → rkey or permissions. Usually a setup/orchestration bug (wrong rkey shared, MR registered with insufficient permissions).
LOC_PROT_ERR / LOC_LEN_ERR → local application bug. Bad SGE pointer or wrong length.
WR_FLUSH_ERR → noise. Find the real error earlier in the same CQ; everything after it gets flushed.

What you should remember

RC reliability is hardware — PSN, ACK/NAK, retransmit, RNR. Mirrors TCP, runs in the NIC. Drops in RoCE v2 → RETRY_EXC_ERR.
librdmacm events tell you where setup broke — UNREACHABLE ≈ fabric, REJECTED ≈ app config, DISCONNECTED ≈ peer/link gone.
SGE, inline data, signaled/unsignaled, WRITE_WITH_IMM — features you'll see in NCCL traces and tuning docs. Recognize them; you don't have to write them.
RETRY_EXC_ERR = fabric problem. Start with switch counters before blaming the NIC or the app.
RNR_RETRY_EXC_ERR = app problem. The network is innocent; the remote app fell behind on posting RECVs.
WR_FLUSH_ERR is noise — look upstream in the same CQ for the real error.

Next section: InfiniBand → — the native RDMA fabric. Then RoCE v2 — the same IB transport on commodity Ethernet, and the fabric this curriculum picks.

How RC makes things reliable​

Connection setup with librdmacm — what operators see​

WR features you'll see in production traces​

Scatter-Gather Entries (SGE)​

Inline data​

Signaled vs unsignaled WRs​

IBV_WR_RDMA_WRITE_WITH_IMM​

IBV_WR_SEND_WITH_IMM​

Completion errors you'll actually see​

Operator triage decision tree​

What you should remember​