How a RoCE v2 Transaction Actually Flows
The previous two pages explained the pieces — that RoCE v2 is IB's transport on Ethernet, and that PFC + ECN + DCQCN exist to make Ethernet lossless enough.
But what does it actually look like when one ibv_post_send() becomes 256 packets on the wire and lands as bytes in remote memory? This page connects the dots.
What's borrowed from where
The single most important thing to internalize: RoCE v2 didn't invent a new RDMA protocol. It copy-pasted InfiniBand's top two layers onto Ethernet's bottom four layers. Same verbs API. Same transport. Different wire underneath.
Read this diagram once and the rest of the page falls into place. The middle column isn't a new design — it's a recombination.
What each piece contributes:
-
From InfiniBand: the API (
ibv_post_sendand friends), the asynchronous work-request model, and the transport headers — BTH (Base Transport Header), RETH (RDMA Extended Transport Header), AETH (ACK Extended Transport Header). These bytes are byte-for-byte identical between IB and RoCE v2 — that's why apps recompile zero lines moving between them. -
From Ethernet / IP: physical layer, MAC, IP routing (BGP, ECMP, the whole leaf-spine you already know), and UDP demultiplexing (port
4791identifies RoCE v2 to the NIC). -
The patch: PFC + ECN bolted onto Ethernet to substitute for what IB's link-layer credit-based flow control did for free.
The reliability split
This is the part that confuses people: if Ethernet drops packets, and RoCE v2 runs on Ethernet, what makes RDMA reliable?
Answer: InfiniBand's transport layer carries the reliability — even when it rides on Ethernet.
- PSN (24-bit Packet Sequence Number) is in every BTH header. The receiver tracks the next expected PSN per QP.
- ACK (in AETH) tells the sender "everything up to PSN N arrived." Cumulative, like TCP.
- NAK (in AETH) tells the sender "I expected PSN X, got PSN Y — gap, retransmit." The sender retransmits from X onward (go-back-N).
- Retransmit timer in NIC hardware fires if no ACK arrives within
local_ack_timeout. Afterretry_countretransmits, the WR completes withIBV_WC_RETRY_EXC_ERR.
All of this lives in the IB transport bytes — the same BTH/AETH that native IB uses. The Ethernet layer doesn't know any of this is happening; it's just shipping packets.
The deal RoCE v2 makes with Ethernet: "I'll handle reliability in my transport (IB's job). You don't lose too many packets (PFC + ECN's job). If you do lose one, my retransmit safety net catches it — but it's coarse-grained and slow, so we both prefer you don't drop."
Anatomy of a transaction — ibv_post_send to bytes in memory
Now the full picture. App calls ibv_post_send with one WR — opcode RDMA_WRITE, length 1 MB, remote address R, rkey K. What happens between that call and the receiver having 1 MB of new data?
The four phases in detail:
Phase A: One WR, one DMA, one doorbell
The app's call to ibv_post_send does exactly two things at the hardware level:
- Writes the WR structure into the QP's send queue (in user-space memory shared with the NIC — no kernel call). The WR is ~64 bytes: opcode, addresses, lengths, keys.
- Rings the doorbell — a single 64-bit MMIO write to a NIC register telling the NIC "new work in this QP."
That's it. The CPU's part is done. From here, the NIC owns the transaction.
Phase B: Chunking is the NIC's job
The NIC reads the WR, sees length = 1 MB, sees the QP's negotiated path MTU = 4096, and decides: 256 packets. Each gets a sequential PSN. Each gets a BTH header with the destination QP's number on the remote NIC. The first one also gets a RETH header carrying the remote address + rkey — so the receiver knows where to write — and the total length.
| Position | Opcode | What's in the headers | Payload |
|---|---|---|---|
| First (PSN 0) | RDMA_WRITE_FIRST | BTH + RETH (raddr, rkey, length) | 4096 B |
| Middle (PSN 1..254) | RDMA_WRITE_MIDDLE | BTH only | 4096 B |
| Last (PSN 255) | RDMA_WRITE_LAST | BTH with AckReq flag set | 4096 B |
Why only the first packet carries RETH? Bandwidth. RETH is 16 bytes — including it on every chunk would burn ~4 GB of header overhead per TB transferred. The receiver's NIC remembers
(raddr, rkey, base_PSN)from the first packet and computes the write address for each subsequent chunk asraddr + (PSN - base_PSN) × MTU.
Phase C: 256 packets on the wire
Each chunk is wrapped in UDP/IP/Ethernet and shot onto the wire. The wire format:
[ Ethernet 14 B | IP 20 B | UDP 8 B (dst=4791) | BTH 12 B | (RETH 16 B, first only) | payload | ICRC 4 B ]
Things to notice on the wire:
- Switches don't know it's RDMA. From the switch's view this is normal UDP traffic with dst port 4791. ECMP hashing uses the UDP src port (which NCCL/the NIC varies for entropy) to spread chunks across spine paths.
- The priority class is what the QoS configuration applied to — typically DSCP 26 → traffic class 3, on which PFC + ECN are active. That's the "no-drop" priority.
- Line rate ≈ ~25 µs for 1 MB at 400 Gbps plus a couple of switch hops. Plus the ACK round-trip.
Phase D: Receiver reassembles, ACK closes the loop
The receiver NIC processes each packet in hardware:
- Validates the rkey — does this remote key allow WRITE access to this MR?
- Checks the PSN is the next expected one. If yes, proceed. If gap, fire NAK.
- Computes the destination address:
raddr + (PSN × MTU). - DMAs the 4 KB payload into the right offset in the receiver's memory region.
- On the last packet (
AckReqflag set), sends one ACK packet back.
Crucially: the receiver's CPU is never told this happened. No CQE on the receiver's CQ. The bytes simply appear at address R. (For applications that need notification, there's RDMA_WRITE_WITH_IMM — see the RDMA section.)
The ACK on the way back is tiny — just BTH + AETH, ~50 bytes total. The sender NIC sees it, marks the WR as complete, and posts a CQE to the local CQ. The app's next ibv_poll_cq sees IBV_WC_SUCCESS and considers the WRITE done.
One WR in. One CQE out. 256 packets in flight in between. ~25 µs. That's the whole transaction.
What if a packet is lost?
This is where the IB transport's reliability machinery earns its keep. PFC and ECN exist precisely so this doesn't happen — but when it does:
This is the InfiniBand transport doing what it was designed to do — running over wire that happens to be UDP/IP/Ethernet instead of native IB. Same PSN. Same AETH syndromes. Same NIC-side retransmit logic. Ethernet has no idea any of this is happening; it just shipped a NAK packet like any other UDP packet.
See it on the wire
Theory is one thing. Here's what it actually looks like when you tcpdump a real RoCE v2 host:
Notice the udp port 4791 filter catches every RoCE v2 packet. The source-port field varies — that's the NIC's entropy for ECMP hashing across spine paths. ethtool -S | grep cnp shows DCQCN actively rate-limiting; adp_retrans = 0 confirms no drops triggered the IB retransmit safety net.
The mental model
If you remember nothing else from this section:
- RoCE v2 = IB's top two layers + Ethernet's bottom four. It isn't a new protocol; it's a reuse.
- The IB transport (BTH + RETH + AETH) is doing the smart work. Reliability, ordering, sequencing — all of that lives in the same bytes whether you're on IB or RoCE.
- The Ethernet underneath provides commodity plumbing. PFC + ECN exist because Ethernet lacks IB's link-layer credit-based flow control.
- One WR → many packets. NIC chunks based on MTU, each chunk gets a sequential PSN, first one carries the remote address. Receiver reassembles by offset.
- The CPU's involvement is one MMIO doorbell + (optionally) one CQ poll. The kernel isn't in the path. The receiver's app isn't even informed.
Next section: Linux for Network Engineers → — what the host has to look like for the NIC to do all this safely (IOMMU, hugepages, pinning, the kernel cmdline that has to be right).