Skip to main content

Understanding RoCEv2

Every header, every bit, every protocol decision — explained. RoCEv2 takes the InfiniBand transport layer and runs it on UDP/IP/Ethernet so you get RDMA speeds on your existing datacenter fabric.

UDP DST PORT
4791
IANA-assigned. Always.
WIRE OVERHEAD
74B
Eth+IP+UDP+BTH+RETH+ICRC
TRANSPORT
UDP
IP proto 17. No TCP.
CONGESTION
DCQCN
ECN + CNP packets

Why RoCEv2 exists

RDMA existed before RoCEv2, but only on InfiniBand — a proprietary, expensive, dedicated fabric. RoCEv2 brought RDMA to standard Ethernet switches by solving the routing problem.

● InfiniBand
NetworkProprietary IB fabric
RoutingLRH / GRH headers
LayerL2 (subnet) + L3 (inter-subnet)
CostHigh — dedicated HCAs, switches
Latency~1 µs
CongestionCredit-based flow control
Used byHPC clusters, supercomputers
● RoCEv1
NetworkEthernet
RoutingEthernet only (L2)
LayerL2 only — not IP-routable
CostLow — uses existing switches
Latency~2–5 µs
CongestionPFC (Priority Flow Control)
Used byStorage, some cloud
● RoCEv2 ← today
NetworkEthernet
RoutingUDP/IP — L3 routable!
LayerL3 — crosses subnets
CostLow — standard switches
Latency~2–5 µs
CongestionPFC + DCQCN (ECN-based)
Used byAI/ML, cloud, storage, HPC

The key RoCEv1 → RoCEv2 change: RoCEv1 used a GRH (Global Routing Header) from InfiniBand — not IP. So it couldn't cross IP router boundaries. RoCEv2 replaced GRH with a real UDP/IP header, making it fully routable across datacenter networks. Same InfiniBand transport layer on top, standard IP underneath.

Protocol stack — every layer

RoCEv2 is not a new protocol from scratch — it's the InfiniBand transport layer encapsulated inside UDP inside IP inside Ethernet. Standard chips at every layer.

RoCEv2 full stack — top to bottom
RoCEv2 STACK Application any language, user space ibverbs / rdma-core user-space library — no kernel involvement IB Transport Layer (BTH) PSN, QPN, ACK, reliability — handled in RNIC UDP (dst port: 4791) enables routing; src port = ECMP hash IPv4 / IPv6 (ECN capable) DSCP for QoS, ECN bits for DCQCN Ethernet II (with PFC) 802.1Qbb PFC for lossless delivery 25G / 100G / 400G Ethernet RNIC handles (hardware) ✓ ibverbs verb processing ✓ BTH generation + PSN tracking ✓ UDP + IP header generation ✓ DMA to/from app memory ✓ ACK / NACK generation ✓ ICRC computation ✓ PFC PAUSE frame response ✓ DCQCN rate control ✓ R_Key / L_Key validation CPU stays free during all of this OS/kernel only touches: QP creation, memory registration setup, CQ events

Packet anatomy — every byte

Click any header block to see its fields, bit widths, and example values. Use the packet type tabs to switch between WRITE, ACK, UD SEND, and CNP packets.

↑ click any segment to inspect its fields
← select a header above

BTH opcodes — the full list

The BTH opcode byte encodes both the service type (top 2 bits: RC=00, UC=01, RD=10, UD=11) and the operation (bottom 6 bits). The combination tells the RNIC which extra headers follow the BTH and what to do with the packet.

OpcodeServiceOperation nameExtra headers after BTHNotes
0x04RCSEND ONLYnone (+ ImmDt if immediate)Single-packet SEND
0x00RCSEND FIRSTnoneFirst packet of multi-packet SEND
0x01RCSEND MIDDLEnoneMiddle packets
0x02RCSEND LASTnoneLast packet of SEND
0x0ARCRDMA WRITE ONLYRETH (16B)Single-packet WRITE — most common
0x06RCRDMA WRITE FIRSTRETH (16B)First packet of multi-packet WRITE; VA+R_Key+len here only
0x07RCRDMA WRITE MIDDLEnoneContinuation; address auto-increments
0x08RCRDMA WRITE LASTnoneNo RETH; remote knows from context
0x0CRCRDMA READ REQUESTRETH (16B)Sent by initiator; remote RNIC sends data back
0x0DRCRDMA READ RESPONSE FIRSTAETH (4B)First packet of read response
0x10RCRDMA READ RESPONSE ONLYAETH (4B)Single-packet read response
0x11RCACKNOWLEDGEAETH (4B)ACK or NAK. Syndrome byte tells which.
0x12RCATOMIC ACKAETH (4B) + AtomicAckETH (8B)Response to CAS/FETCH_ADD
0x13RCCMP AND SWAPAtomicETH (28B)Compare-and-swap — hardware atomic
0x14RCFETCH AND ADDAtomicETH (28B)Atomic add — hardware atomic
0x64UDSEND ONLYDETH (8B)UD only has SEND — no WRITE/READ
0x68UDSEND ONLY WITH IMMEDIATEDETH (8B) + ImmDt (4B)Carries extra 4B immediate value
0x81CNPCONGESTION NOTIFICATIONCNP payload (16B zeroed)RoCEv2 specific. Receiver sends to sender on ECN mark.
0x44UCRDMA WRITE ONLYRETH (16B)Like RC WRITE but no ACK

Opcode anatomy: Take 0x0A = 0b00001010. Top 2 bits = 00 = RC service. Bottom 6 bits = 001010 = RDMA WRITE ONLY operation. The RNIC decodes these to know: "RC connection, single-packet WRITE, expect RETH header next, then payload, then ICRC."

Lossless Ethernet and PFC

RDMA's RC mode handles packet loss with retransmit — but retransmit in RDMA is expensive. Each retransmit wastes bandwidth and stalls the pipeline. RoCEv2 is typically run over lossless Ethernet, achieved with PFC.

Why "lossless"? Standard Ethernet drops packets when a switch buffer overflows (tail-drop or RED). For bulk TCP, that's fine — TCP handles it. For RDMA RC, a single dropped packet triggers a retransmit of potentially the entire message window. At 100Gbps this costs milliseconds — killing the latency advantage.

PFC — Priority Flow Control (802.1Qbb)

PFC adds backpressure to Ethernet. When a switch's ingress buffer for a specific priority queue exceeds a threshold, it sends a PAUSE frame upstream on that link, telling the sender to stop. The upstream NIC pauses within ~one frame time.

PFC backpressure animation
Sender RNICposting WRsToR Switchingress queue0%thresholdpriority 3Receiverconsuming dataPAUSE (pri:3)Press play to see PFC backpressure
Lossless Ethernet backpressure demo

PFC PAUSE frame format

Destination MAC01:80:C2:00:00:01 (multicast — link-local)
EtherType0x8808 (MAC Control)
Opcode0x0101 (PFC PAUSE)
Priority enable vector8 bits — which priorities to pause (e.g., bit 3 = priority 3)
Quanta per priority512-bit time units to pause. 0xFFFF = pause until RESUME (quanta=0)

PFC operates per-priority. RoCEv2 traffic is typically placed on one 802.1p priority (often 3 or 5). PFC only pauses that priority — other traffic (management, storage) keeps flowing. This avoids the head-of-line blocking of classic Ethernet PAUSE which stops all traffic.

DCQCN — congestion control

PFC prevents loss but creates "pause storms" if overused. DCQCN (Data Center Quantized Congestion Notification) is a rate-based algorithm that reacts to congestion before queues fill — avoiding PFC entirely in steady state.

● PFC alone

Queue fills → PAUSE sent → upstream pauses → congestion propagates backward through the fabric ("pause storm"). Simple but creates head-of-line blocking and kills throughput for unrelated flows.

● DCQCN (ECN + CNP)

Switch marks ECN bits before the queue fills → receiver generates CNP → sender reduces rate → congestion resolved before PAUSE needed. Flows self-regulate. Much lower latency impact.

DCQCN feedback loop
Sender RNICcurrent rate:100%Switch25% fullECN markReceiver RNICsees ECN=CEsends CNP backsender rate0%100%Press play → watch congestion signal flow back to sender
ECN-based rate control demo

The CNP packet — Congestion Notification Packet

BTH Opcode0x81 — special CNP opcode. Uses the same QP as the data flow.
DirectionReceiver → Sender (reverse of data flow)
Payload16 bytes, all zero (source QPN in BTH tells the sender which flow)
Rate reductionSender reduces current rate by a multiplicative factor (typ. ×0.5 or configured)
Rate recoveryAdditive increase after a quiet period. Uses byte counter + timer.
CNP generation rateAt most once per 50µs per flow — coalesced to prevent CNP storms

ECN bits in IP header: The 2 ECN bits (in the IP DSCP+ECN byte) have 4 states: 00=not ECN-capable, 01=ECT(1), 10=ECT(0), 11=CE (Congestion Experienced). RoCEv2 senders mark packets as ECT(0) or ECT(1). When a switch's queue hits the ECN threshold, it changes ECT→CE. The receiver sees CE and generates a CNP.

Connection setup — CM handshake

Before any RDMA operation, the two sides must exchange Queue Pair Numbers (QPNs) and create a connected QP. This uses the RDMA CM (Connection Manager) — a user-space library that runs its signalling over UD QPs (or IP sockets for RoCEv2).

Step 11 / 6

Step 1 — Server binds and listens

Server application calls rdma_bind_addr() and rdma_listen() on the RDMA CM. This creates a CM ID and registers the server's address. Under the hood, CM uses a UD QP (or IP socket for RoCEv2) to receive connection requests.

Server App rdma_listen() port 18515 CM — waiting QP in INIT state

Unlike TCP, connection setup is rare. TCP connections are cheap and ephemeral. RDMA QPs are expensive to create (kernel call, hardware resources), so they're kept alive for the session. Most RDMA apps create QPs at startup and reuse them for the entire job lifetime — hours or days.

ECMP and multipath routing

RoCEv2's use of UDP enables ECMP (Equal-Cost Multi-Path) load balancing. Switches hash the 5-tuple (src IP, dst IP, src port, dst port, protocol) to pick a path. Since RoCEv2's dst port is always 4791, the UDP source port is the key differentiator.

ECMP flow distribution — two flows, different paths
Host A QP1 → src:0xC001 QP2 → src:0xC002 dst port always 4791 Spine 1 hash→ path A Spine 2 hash→ path B Host B QP1 data QP2 data same RNIC, different QP destinations src:0xC001 src:0xC002 QP1 flow QP2 flow QP1 (src:C001) QP2 (src:C002)

How the RNIC sets the UDP source port

GoalSpread flows across all available ECMP paths
InputHash of: src IP, dst IP, src QPN, dst QPN (Mellanox ConnectX default)
Output range0xC000–0xFFFF (top 16K ephemeral range, per vendor)
Packet orderingRoCEv2 spec: packets with same src+dst+UDP src port MUST NOT reorder
Multi-path cautionReordering across paths = NACK storm. ECMP relies on consistent hashing.

Jumbo frames: RoCEv2 works best with MTU 9000 (jumbo frames). Larger MTU = fewer packets per message = lower overhead. Standard 1500B MTU means a 4MB RDMA WRITE becomes ~2730 packets instead of ~450 — 6× more PSN tracking, 6× more ACKs. Always enable jumbo frames for RoCEv2.

RoCEv2 cheatsheet

Key values to memorize

UDP dst port4791 (0x12B7)
IP protocol17 (UDP)
ICRC size4 bytes (always last)
BTH size12 bytes (always present)
RETH size16 bytes (WRITE/READ only)
AETH size4 bytes (ACK/NAK only)
DETH size8 bytes (UD only)
CNP opcode0x81
Recommended MTU9000 (jumbo frames)
ECN: ECT(0)IP ECN bits = 10
ECN: CE markIP ECN bits = 11
CNP rate limit1 per 50µs per flow

AETH syndrome codes

0x00ACK — successful delivery
0x01–0x1FNAK — sequence error, invalid R_Key, remote op error
0x20NAK — invalid request
0x21NAK — remote access error (bad R_Key)
0x60–0x7FRNR NAK — receiver not ready (no RECV posted)

RoCEv1 vs RoCEv2

Layer 3RoCEv1: GRH (IB header) / RoCEv2: UDP/IP
RoutableRoCEv1: L2 only / RoCEv2: L3 ✓
ECMPRoCEv1: no / RoCEv2: yes (UDP src port)
Wireshark filterudp.port == 4791
EtherType (v1)0x8915 (InfiniBand over Ethernet)