Skip to main content

How a RoCE v2 Transaction Actually Flows

The previous two pages explained the pieces — that RoCE v2 is IB's transport on Ethernet, and that PFC + ECN + DCQCN exist to make Ethernet lossless enough.

But what does it actually look like when one ibv_post_send() becomes 256 packets on the wire and lands as bytes in remote memory? This page connects the dots.


What's borrowed from where

The single most important thing to internalize: RoCE v2 didn't invent a new RDMA protocol. It copy-pasted InfiniBand's top two layers onto Ethernet's bottom four layers. Same verbs API. Same transport. Different wire underneath.

Three-column stack comparison. Left column native InfiniBand: verbs API, IB transport (BTH + RETH + AETH), IB link layer (LRH + credit-based flow control), IB physical. Right column native TCP/IP: sockets API, TCP, IP, Ethernet MAC, Ethernet PHY. Middle column RoCE v2: verbs API (FROM IB), transport BTH+RETH+AETH (FROM IB, same bytes as native IB), UDP port 4791 (FROM IP), IP (FROM IP), Ethernet MAC (FROM Ethernet), Ethernet PHY + PFC + ECN (FROM Ethernet). Dashed arrows show the borrows — IB contributes top two layers, IP contributes UDP + IP, Ethernet contributes the bottom two layers.
The middle column is just the left column's top two layers grafted onto the right column's bottom three. App code is identical to IB. Switching is identical to Ethernet.

Read this diagram once and the rest of the page falls into place. The middle column isn't a new design — it's a recombination.

What each piece contributes:

  • From InfiniBand: the API (ibv_post_send and friends), the asynchronous work-request model, and the transport headers — BTH (Base Transport Header), RETH (RDMA Extended Transport Header), AETH (ACK Extended Transport Header). These bytes are byte-for-byte identical between IB and RoCE v2 — that's why apps recompile zero lines moving between them.

  • From Ethernet / IP: physical layer, MAC, IP routing (BGP, ECMP, the whole leaf-spine you already know), and UDP demultiplexing (port 4791 identifies RoCE v2 to the NIC).

  • The patch: PFC + ECN bolted onto Ethernet to substitute for what IB's link-layer credit-based flow control did for free.


The reliability split

This is the part that confuses people: if Ethernet drops packets, and RoCE v2 runs on Ethernet, what makes RDMA reliable?

Answer: InfiniBand's transport layer carries the reliability — even when it rides on Ethernet.

  • PSN (24-bit Packet Sequence Number) is in every BTH header. The receiver tracks the next expected PSN per QP.
  • ACK (in AETH) tells the sender "everything up to PSN N arrived." Cumulative, like TCP.
  • NAK (in AETH) tells the sender "I expected PSN X, got PSN Y — gap, retransmit." The sender retransmits from X onward (go-back-N).
  • Retransmit timer in NIC hardware fires if no ACK arrives within local_ack_timeout. After retry_count retransmits, the WR completes with IBV_WC_RETRY_EXC_ERR.

All of this lives in the IB transport bytes — the same BTH/AETH that native IB uses. The Ethernet layer doesn't know any of this is happening; it's just shipping packets.

The deal RoCE v2 makes with Ethernet: "I'll handle reliability in my transport (IB's job). You don't lose too many packets (PFC + ECN's job). If you do lose one, my retransmit safety net catches it — but it's coarse-grained and slow, so we both prefer you don't drop."


Anatomy of a transaction — ibv_post_send to bytes in memory

Now the full picture. App calls ibv_post_send with one WR — opcode RDMA_WRITE, length 1 MB, remote address R, rkey K. What happens between that call and the receiver having 1 MB of new data?

Four-phase end-to-end RoCE v2 WRITE walk-through. Phase A: sender app calls ibv_post_send with RDMA_WRITE, local addr L, remote addr R, rkey K, length 1MB; NIC reads the WR via doorbell, DMA-reads payload from local memory region (pinned, IOMMU-mapped). Phase B: NIC segments 1 MB into 256 × 4 KB chunks (MTU = 4096), each with a sequential PSN 0-255. First packet has opcode RDMA_WRITE_FIRST with RETH carrying raddr+rkey+length; middle packets RDMA_WRITE_MIDDLE with BTH only; last RDMA_WRITE_LAST with AckReq flag. Each packet wrapped in UDP port 4791 over IP over Ethernet, with ICRC trailer. Phase C: 256 packets stream on the wire, switches forward by IP and Ethernet headers (using UDP src-port for ECMP entropy), PFC + ECN keep the priority lossless. Phase D: receiver NIC validates rkey + bounds per packet, DMAs each payload into remote memory at R + PSN×4096; last packet triggers a single ACK back; sender posts a CQE, app polls and sees IBV_WC_SUCCESS. One WR in, one CQE out, 256 packets done in ~25 µs.
App makes one library call. NIC + IB transport handles chunking, ordering, reliability, ACK. No kernel. No copies. ~25 µs for 1 MB at 400 Gbps.

The four phases in detail:

Phase A: One WR, one DMA, one doorbell

The app's call to ibv_post_send does exactly two things at the hardware level:

  1. Writes the WR structure into the QP's send queue (in user-space memory shared with the NIC — no kernel call). The WR is ~64 bytes: opcode, addresses, lengths, keys.
  2. Rings the doorbell — a single 64-bit MMIO write to a NIC register telling the NIC "new work in this QP."

That's it. The CPU's part is done. From here, the NIC owns the transaction.

Phase B: Chunking is the NIC's job

The NIC reads the WR, sees length = 1 MB, sees the QP's negotiated path MTU = 4096, and decides: 256 packets. Each gets a sequential PSN. Each gets a BTH header with the destination QP's number on the remote NIC. The first one also gets a RETH header carrying the remote address + rkey — so the receiver knows where to write — and the total length.

PositionOpcodeWhat's in the headersPayload
First (PSN 0)RDMA_WRITE_FIRSTBTH + RETH (raddr, rkey, length)4096 B
Middle (PSN 1..254)RDMA_WRITE_MIDDLEBTH only4096 B
Last (PSN 255)RDMA_WRITE_LASTBTH with AckReq flag set4096 B

Why only the first packet carries RETH? Bandwidth. RETH is 16 bytes — including it on every chunk would burn ~4 GB of header overhead per TB transferred. The receiver's NIC remembers (raddr, rkey, base_PSN) from the first packet and computes the write address for each subsequent chunk as raddr + (PSN - base_PSN) × MTU.

Phase C: 256 packets on the wire

Each chunk is wrapped in UDP/IP/Ethernet and shot onto the wire. The wire format:

[ Ethernet 14 B | IP 20 B | UDP 8 B (dst=4791) | BTH 12 B | (RETH 16 B, first only) | payload | ICRC 4 B ]

Things to notice on the wire:

  • Switches don't know it's RDMA. From the switch's view this is normal UDP traffic with dst port 4791. ECMP hashing uses the UDP src port (which NCCL/the NIC varies for entropy) to spread chunks across spine paths.
  • The priority class is what the QoS configuration applied to — typically DSCP 26 → traffic class 3, on which PFC + ECN are active. That's the "no-drop" priority.
  • Line rate ≈ ~25 µs for 1 MB at 400 Gbps plus a couple of switch hops. Plus the ACK round-trip.

Phase D: Receiver reassembles, ACK closes the loop

The receiver NIC processes each packet in hardware:

  1. Validates the rkey — does this remote key allow WRITE access to this MR?
  2. Checks the PSN is the next expected one. If yes, proceed. If gap, fire NAK.
  3. Computes the destination address: raddr + (PSN × MTU).
  4. DMAs the 4 KB payload into the right offset in the receiver's memory region.
  5. On the last packet (AckReq flag set), sends one ACK packet back.

Crucially: the receiver's CPU is never told this happened. No CQE on the receiver's CQ. The bytes simply appear at address R. (For applications that need notification, there's RDMA_WRITE_WITH_IMM — see the RDMA section.)

The ACK on the way back is tiny — just BTH + AETH, ~50 bytes total. The sender NIC sees it, marks the WR as complete, and posts a CQE to the local CQ. The app's next ibv_poll_cq sees IBV_WC_SUCCESS and considers the WRITE done.

One WR in. One CQE out. 256 packets in flight in between. ~25 µs. That's the whole transaction.


What if a packet is lost?

This is where the IB transport's reliability machinery earns its keep. PFC and ECN exist precisely so this doesn't happen — but when it does:

Sequence diagram of RC reliability under packet loss. Sender NIC transmits packets PSN 1, 2, 3, 4. PSN 2 is dropped on the wire. Receiver gets PSN 3 and detects the PSN-2 gap, returns a NAK referencing PSN 2. Sender's NIC retransmits PSN 2 then PSN 3 (orange — go-back-N). Receiver returns a cumulative ACK for everything up to PSN 3. PSN 4 follows normally. Whole exchange happens between the two NICs without CPU involvement; after retry_count retransmits with no ACK, sender's WR completes with IBV_WC_RETRY_EXC_ERR.
PSN tracking, NAK on gap, retransmit — all in NIC hardware. Go-back-N: if PSN 2 was lost, *both* PSN 2 and PSN 3 get retransmitted. Coarse, which is why we tune PFC/ECN to make this almost never fire.

This is the InfiniBand transport doing what it was designed to do — running over wire that happens to be UDP/IP/Ethernet instead of native IB. Same PSN. Same AETH syndromes. Same NIC-side retransmit logic. Ethernet has no idea any of this is happening; it just shipped a NAK packet like any other UDP packet.


See it on the wire

Theory is one thing. Here's what it actually looks like when you tcpdump a real RoCE v2 host:

MODULE roce-v2 · LAB 2Watch the recording or run the real environment in your browser.
Open in CodespacesFREE
💡 Codespaces gives every GitHub user 60 free hours/month. No install required.

Notice the udp port 4791 filter catches every RoCE v2 packet. The source-port field varies — that's the NIC's entropy for ECMP hashing across spine paths. ethtool -S | grep cnp shows DCQCN actively rate-limiting; adp_retrans = 0 confirms no drops triggered the IB retransmit safety net.


The mental model

If you remember nothing else from this section:

  1. RoCE v2 = IB's top two layers + Ethernet's bottom four. It isn't a new protocol; it's a reuse.
  2. The IB transport (BTH + RETH + AETH) is doing the smart work. Reliability, ordering, sequencing — all of that lives in the same bytes whether you're on IB or RoCE.
  3. The Ethernet underneath provides commodity plumbing. PFC + ECN exist because Ethernet lacks IB's link-layer credit-based flow control.
  4. One WR → many packets. NIC chunks based on MTU, each chunk gets a sequential PSN, first one carries the remote address. Receiver reassembles by offset.
  5. The CPU's involvement is one MMIO doorbell + (optionally) one CQ poll. The kernel isn't in the path. The receiver's app isn't even informed.

Next section: Linux for Network Engineers → — what the host has to look like for the NIC to do all this safely (IOMMU, hugepages, pinning, the kernel cmdline that has to be right).