Understanding RoCEv2
Every header, every bit, every protocol decision — explained. RoCEv2 takes the InfiniBand transport layer and runs it on UDP/IP/Ethernet so you get RDMA speeds on your existing datacenter fabric.
Why RoCEv2 exists
RDMA existed before RoCEv2, but only on InfiniBand — a proprietary, expensive, dedicated fabric. RoCEv2 brought RDMA to standard Ethernet switches by solving the routing problem.
| Network | Proprietary IB fabric |
| Routing | LRH / GRH headers |
| Layer | L2 (subnet) + L3 (inter-subnet) |
| Cost | High — dedicated HCAs, switches |
| Latency | ~1 µs |
| Congestion | Credit-based flow control |
| Used by | HPC clusters, supercomputers |
| Network | Ethernet |
| Routing | Ethernet only (L2) |
| Layer | L2 only — not IP-routable |
| Cost | Low — uses existing switches |
| Latency | ~2–5 µs |
| Congestion | PFC (Priority Flow Control) |
| Used by | Storage, some cloud |
| Network | Ethernet |
| Routing | UDP/IP — L3 routable! |
| Layer | L3 — crosses subnets |
| Cost | Low — standard switches |
| Latency | ~2–5 µs |
| Congestion | PFC + DCQCN (ECN-based) |
| Used by | AI/ML, cloud, storage, HPC |
The key RoCEv1 → RoCEv2 change: RoCEv1 used a GRH (Global Routing Header) from InfiniBand — not IP. So it couldn't cross IP router boundaries. RoCEv2 replaced GRH with a real UDP/IP header, making it fully routable across datacenter networks. Same InfiniBand transport layer on top, standard IP underneath.
Protocol stack — every layer
RoCEv2 is not a new protocol from scratch — it's the InfiniBand transport layer encapsulated inside UDP inside IP inside Ethernet. Standard chips at every layer.
Packet anatomy — every byte
Click any header block to see its fields, bit widths, and example values. Use the packet type tabs to switch between WRITE, ACK, UD SEND, and CNP packets.
BTH opcodes — the full list
The BTH opcode byte encodes both the service type (top 2 bits: RC=00, UC=01, RD=10, UD=11) and the operation (bottom 6 bits). The combination tells the RNIC which extra headers follow the BTH and what to do with the packet.
| Opcode | Service | Operation name | Extra headers after BTH | Notes |
|---|---|---|---|---|
| 0x04 | RC | SEND ONLY | none (+ ImmDt if immediate) | Single-packet SEND |
| 0x00 | RC | SEND FIRST | none | First packet of multi-packet SEND |
| 0x01 | RC | SEND MIDDLE | none | Middle packets |
| 0x02 | RC | SEND LAST | none | Last packet of SEND |
| 0x0A | RC | RDMA WRITE ONLY | RETH (16B) | Single-packet WRITE — most common |
| 0x06 | RC | RDMA WRITE FIRST | RETH (16B) | First packet of multi-packet WRITE; VA+R_Key+len here only |
| 0x07 | RC | RDMA WRITE MIDDLE | none | Continuation; address auto-increments |
| 0x08 | RC | RDMA WRITE LAST | none | No RETH; remote knows from context |
| 0x0C | RC | RDMA READ REQUEST | RETH (16B) | Sent by initiator; remote RNIC sends data back |
| 0x0D | RC | RDMA READ RESPONSE FIRST | AETH (4B) | First packet of read response |
| 0x10 | RC | RDMA READ RESPONSE ONLY | AETH (4B) | Single-packet read response |
| 0x11 | RC | ACKNOWLEDGE | AETH (4B) | ACK or NAK. Syndrome byte tells which. |
| 0x12 | RC | ATOMIC ACK | AETH (4B) + AtomicAckETH (8B) | Response to CAS/FETCH_ADD |
| 0x13 | RC | CMP AND SWAP | AtomicETH (28B) | Compare-and-swap — hardware atomic |
| 0x14 | RC | FETCH AND ADD | AtomicETH (28B) | Atomic add — hardware atomic |
| 0x64 | UD | SEND ONLY | DETH (8B) | UD only has SEND — no WRITE/READ |
| 0x68 | UD | SEND ONLY WITH IMMEDIATE | DETH (8B) + ImmDt (4B) | Carries extra 4B immediate value |
| 0x81 | CNP | CONGESTION NOTIFICATION | CNP payload (16B zeroed) | RoCEv2 specific. Receiver sends to sender on ECN mark. |
| 0x44 | UC | RDMA WRITE ONLY | RETH (16B) | Like RC WRITE but no ACK |
Opcode anatomy: Take
0x0A=0b00001010. Top 2 bits =00= RC service. Bottom 6 bits =001010= RDMA WRITE ONLY operation. The RNIC decodes these to know: "RC connection, single-packet WRITE, expect RETH header next, then payload, then ICRC."
Lossless Ethernet and PFC
RDMA's RC mode handles packet loss with retransmit — but retransmit in RDMA is expensive. Each retransmit wastes bandwidth and stalls the pipeline. RoCEv2 is typically run over lossless Ethernet, achieved with PFC.
Why "lossless"? Standard Ethernet drops packets when a switch buffer overflows (tail-drop or RED). For bulk TCP, that's fine — TCP handles it. For RDMA RC, a single dropped packet triggers a retransmit of potentially the entire message window. At 100Gbps this costs milliseconds — killing the latency advantage.
PFC — Priority Flow Control (802.1Qbb)
PFC adds backpressure to Ethernet. When a switch's ingress buffer for a specific priority queue exceeds a threshold, it sends a PAUSE frame upstream on that link, telling the sender to stop. The upstream NIC pauses within ~one frame time.
PFC PAUSE frame format
| Destination MAC | 01:80:C2:00:00:01 (multicast — link-local) |
| EtherType | 0x8808 (MAC Control) |
| Opcode | 0x0101 (PFC PAUSE) |
| Priority enable vector | 8 bits — which priorities to pause (e.g., bit 3 = priority 3) |
| Quanta per priority | 512-bit time units to pause. 0xFFFF = pause until RESUME (quanta=0) |
PFC operates per-priority. RoCEv2 traffic is typically placed on one 802.1p priority (often 3 or 5). PFC only pauses that priority — other traffic (management, storage) keeps flowing. This avoids the head-of-line blocking of classic Ethernet PAUSE which stops all traffic.
DCQCN — congestion control
PFC prevents loss but creates "pause storms" if overused. DCQCN (Data Center Quantized Congestion Notification) is a rate-based algorithm that reacts to congestion before queues fill — avoiding PFC entirely in steady state.
Queue fills → PAUSE sent → upstream pauses → congestion propagates backward through the fabric ("pause storm"). Simple but creates head-of-line blocking and kills throughput for unrelated flows.
Switch marks ECN bits before the queue fills → receiver generates CNP → sender reduces rate → congestion resolved before PAUSE needed. Flows self-regulate. Much lower latency impact.
The CNP packet — Congestion Notification Packet
| BTH Opcode | 0x81 — special CNP opcode. Uses the same QP as the data flow. |
| Direction | Receiver → Sender (reverse of data flow) |
| Payload | 16 bytes, all zero (source QPN in BTH tells the sender which flow) |
| Rate reduction | Sender reduces current rate by a multiplicative factor (typ. ×0.5 or configured) |
| Rate recovery | Additive increase after a quiet period. Uses byte counter + timer. |
| CNP generation rate | At most once per 50µs per flow — coalesced to prevent CNP storms |
ECN bits in IP header: The 2 ECN bits (in the IP DSCP+ECN byte) have 4 states:
00=not ECN-capable,01=ECT(1),10=ECT(0),11=CE (Congestion Experienced). RoCEv2 senders mark packets as ECT(0) or ECT(1). When a switch's queue hits the ECN threshold, it changes ECT→CE. The receiver sees CE and generates a CNP.
Connection setup — CM handshake
Before any RDMA operation, the two sides must exchange Queue Pair Numbers (QPNs) and create a connected QP. This uses the RDMA CM (Connection Manager) — a user-space library that runs its signalling over UD QPs (or IP sockets for RoCEv2).
Step 1 — Server binds and listens
Server application calls rdma_bind_addr() and rdma_listen() on the RDMA CM. This creates a CM ID and registers the server's address. Under the hood, CM uses a UD QP (or IP socket for RoCEv2) to receive connection requests.
Unlike TCP, connection setup is rare. TCP connections are cheap and ephemeral. RDMA QPs are expensive to create (kernel call, hardware resources), so they're kept alive for the session. Most RDMA apps create QPs at startup and reuse them for the entire job lifetime — hours or days.
ECMP and multipath routing
RoCEv2's use of UDP enables ECMP (Equal-Cost Multi-Path) load balancing. Switches hash the 5-tuple (src IP, dst IP, src port, dst port, protocol) to pick a path. Since RoCEv2's dst port is always 4791, the UDP source port is the key differentiator.
How the RNIC sets the UDP source port
| Goal | Spread flows across all available ECMP paths |
| Input | Hash of: src IP, dst IP, src QPN, dst QPN (Mellanox ConnectX default) |
| Output range | 0xC000–0xFFFF (top 16K ephemeral range, per vendor) |
| Packet ordering | RoCEv2 spec: packets with same src+dst+UDP src port MUST NOT reorder |
| Multi-path caution | Reordering across paths = NACK storm. ECMP relies on consistent hashing. |
Jumbo frames: RoCEv2 works best with MTU 9000 (jumbo frames). Larger MTU = fewer packets per message = lower overhead. Standard 1500B MTU means a 4MB RDMA WRITE becomes ~2730 packets instead of ~450 — 6× more PSN tracking, 6× more ACKs. Always enable jumbo frames for RoCEv2.
RoCEv2 cheatsheet
Key values to memorize
| UDP dst port | 4791 (0x12B7) |
| IP protocol | 17 (UDP) |
| ICRC size | 4 bytes (always last) |
| BTH size | 12 bytes (always present) |
| RETH size | 16 bytes (WRITE/READ only) |
| AETH size | 4 bytes (ACK/NAK only) |
| DETH size | 8 bytes (UD only) |
| CNP opcode | 0x81 |
| Recommended MTU | 9000 (jumbo frames) |
| ECN: ECT(0) | IP ECN bits = 10 |
| ECN: CE mark | IP ECN bits = 11 |
| CNP rate limit | 1 per 50µs per flow |
AETH syndrome codes
| 0x00 | ACK — successful delivery |
| 0x01–0x1F | NAK — sequence error, invalid R_Key, remote op error |
| 0x20 | NAK — invalid request |
| 0x21 | NAK — remote access error (bad R_Key) |
| 0x60–0x7F | RNR NAK — receiver not ready (no RECV posted) |
RoCEv1 vs RoCEv2
| Layer 3 | RoCEv1: GRH (IB header) / RoCEv2: UDP/IP |
| Routable | RoCEv1: L2 only / RoCEv2: L3 ✓ |
| ECMP | RoCEv1: no / RoCEv2: yes (UDP src port) |
| Wireshark filter | udp.port == 4791 |
| EtherType (v1) | 0x8915 (InfiniBand over Ethernet) |