Skip to main content

Under the Hood

You know what InfiniBand is from the previous page. Now how it actually works — the four mechanics a network engineer needs to understand to read a SuperPOD config or compare it sensibly with RoCE v2:

  1. Credit-based flow control — why IB can't drop
  2. The Subnet Manager — one brain runs the fabric
  3. Addressing — LID + GID instead of MAC + IP
  4. The wire format — what's on the link

Plus a quick reference on speed grades (SDR → XDR) so the alphabet soup in vendor data sheets makes sense.


1. Credit-based flow control

The defining feature. The thing RoCE v2 spends PFC + ECN trying to emulate. Worth understanding deeply.

The rule: the sender can only transmit if the receiver has explicitly granted credit for that packet. No credit, no packet. Period.

Sequence diagram of InfiniBand credit-based flow control. Receiver advertises 4 credits at the start. Sender transmits packets PKT 1 through PKT 4, decrementing its credit counter from 4 to 0. With credit at 0, sender wants to send PKT 5 but STOPS — it cannot transmit. Receiver processes a packet, frees a buffer slot, and sends a credit increment back. Sender's counter goes back to 1, sender sends PKT 5. Steady-state pattern continues — every data packet eventually buys one returned credit.
Sender pauses BEFORE sending. Drop is structurally impossible. Compare RoCE v2's PFC, which pauses *after* congestion has started — with milliseconds of buffer slosh and corner cases.

How the math works:

  • Every IB port has a receive buffer of fixed size (e.g. 64 KB per virtual lane).
  • The receiver advertises buffer credits to its peer — "I can absorb N more 64-byte chunks of your data."
  • The sender keeps a counter. Each packet sent decrements by the packet's size; each credit grant increments by the granted amount.
  • If the sender's counter would go negative, the sender does not send. Hardware enforced, no software involvement.

Why this matters versus PFC (RoCE v2's substitute):

AspectInfiniBand creditsRoCE v2 PFC
When the pause happensBefore transmission (sender holds back)After congestion (PFC frame sent upstream from the congested point)
Drop possible?No — sender literally can't sendYes, if buffers fill faster than PFC can react
Backpressure granularityPer virtual lane (8 VLs)Per traffic class (8 TCs)
Buffer required at receiverBounded by advertised creditsSized for ~RTT × bandwidth (huge at 400G)
Tuning knobsFew — fabric is self-tuningMany — buffer thresholds, PFC headroom, ECN watermark, all of it

The IB approach is structurally simpler. The RoCE v2 approach is operationally cheaper (commodity Ethernet) but you spend the savings on careful tuning.


2. The Subnet Manager — one brain runs the fabric

InfiniBand has a centralized control plane. The Subnet Manager (SM) is a daemon that runs somewhere on the fabric (typically on a management server or a switch with embedded SM). It discovers every switch and port, assigns addresses, computes routing tables, and pushes them to every switch.

Side-by-side comparison of InfiniBand and Ethernet control planes. Left panel InfiniBand: four IB switches arranged in a grid with a central Subnet Manager (SM) node. Dashed orange lines from SM to each switch indicate centralized control — SM assigns LIDs, computes routes, and pushes them. Right panel Ethernet: four switches each running its own BGP daemon, with dashed red lines between them indicating eBGP peer sessions. Each switch decides its own routes; convergence takes time. Caption: IB has one brain, Ethernet has many.
`opensm` is the reference implementation. NVIDIA UFM is the commercial fleet manager built on top. Stop the SM, the data plane keeps working — restart it and routes get re-pushed.

What the SM actually does:

  • Sweeps the fabric every few seconds to discover topology changes
  • Assigns LIDs (Local Identifiers) to every port — basically picks a 16-bit address for everyone
  • Computes routing tables for every switch (typically Up*/Down* routing in HPC, fat-tree in AI)
  • Pushes routing tables via MAD (Management Datagram) packets on a special QP
  • Configures partitions (P_Keys — IB's equivalent of VLANs)
  • Configures QoS (service levels mapped to virtual lanes)

The good: one source of truth. Want to know why a path is what it is? Ask the SM. Want to reroute? Restart the SM or use smpquery. Configure once, push everywhere.

The bad: the SM is a single logical entity. Most deployments run an active SM + a standby SM on a second node for HA, but the model still requires a designated brain. Compared to Ethernet where every switch is autonomous, the SM-based model is operationally different.

Hyperscalers picked Ethernet partly because of this — distributed BGP scales horizontally, an SM-based fabric scales by making the SM faster. At 100K+ ports the math eventually breaks for IB.


3. Addressing — LID + GID

InfiniBand has two address levels, like Ethernet has MAC + IP:

IBEthernet analogSizeScope
LID (Local Identifier)MAC address16 bitsWithin one subnet (one SM domain)
GID (Global Identifier)IPv6 address128 bitsAcross subnets — used when routing through IB routers

For most clusters, the entire fabric is one subnet — there's only one SM, one set of LIDs. GIDs exist for inter-subnet routing but rarely fire in a single cluster. NCCL on IB primarily uses LIDs.

The LID assignment process:

  1. Each NIC/switch port comes up with no LID
  2. The SM discovers it via the discovery sweep
  3. The SM assigns a LID from its pool
  4. The SM updates routing tables fabric-wide so this LID is reachable

A LID is only 16 bits → ~48K usable values per subnet. Big enough for the largest practical IB fabrics, small enough that the routing table fits in a switch's TCAM.

GIDs are 128 bits and look like IPv6 addresses. You saw the GID format in the RDMA section — the same (GID, QP num) pair identifies an endpoint on either fabric. On IB the GID is assigned by the SM (usually deterministically derived from the LID + a subnet prefix). On RoCE v2 it's derived from the IP.


4. The wire format

What an InfiniBand packet looks like on the link — and side by side with RoCE v2 to show the lineage:

Three rows comparing wire formats. Row 1 TCP/IP: Ethernet 14 B, IP 20 B, TCP 20 B, payload — 54-byte header. Row 2 RoCE v2: Ethernet 14 B, IP 20 B, UDP 8 B, BTH (Base Transport Header) 12 B, RETH (RDMA Extended Transport Header) 16 B, payload, ICRC 4 B — 70-byte header. Row 3 InfiniBand: LRH (Local Routing Header) 8 B, BTH 12 B, RETH 16 B, payload, ICRC 4 B, VCRC 2 B — 42-byte header. BTH and RETH are identical bytes in rows 2 and 3; RoCE v2 reuses the IB transport unchanged, just swapping the bottom layers.
The orange and purple blocks (BTH + RETH) are IB's transport — and they're literally the same bytes in RoCE v2. RoCE v2 swaps IB's LRH + physical layer for standard Ethernet + IP + UDP.

Key headers:

  • LRH (Local Routing Header) — 8 bytes. Contains the destination LID and service level. Switches forward on this. Replaced by Ethernet + IP in RoCE v2.
  • BTH (Base Transport Header) — 12 bytes. Carries the destination QP number, opcode (SEND / WRITE / READ / ATOMIC), PSN (packet sequence number). This is the IB transport. Identical bytes in IB and RoCE v2.
  • RETH (RDMA Extended Transport Header) — 16 bytes. Only for one-sided ops. Carries the remote address + rkey + length.
  • ICRC — invariant CRC over fields that don't change in transit
  • VCRC — variant CRC over the whole packet, recomputed at each hop (IB only)

The whole point of this comparison: RoCE v2 didn't reinvent the transport. It took IB's BTH + RETH unchanged and bolted Ethernet + IP + UDP underneath. The semantics of one packet are identical.


5. Speed grades — the alphabet soup

InfiniBand speed names are letter abbreviations that change every few years. Here's the cheat sheet:

GradeYearPer-lane rateTypical 4× port
SDR (Single Data Rate)20032.5 Gbps10 Gbps
DDR (Double Data Rate)20055 Gbps20 Gbps
QDR (Quad Data Rate)200810 Gbps40 Gbps
FDR (Fourteen Data Rate)201114 Gbps56 Gbps
EDR (Enhanced Data Rate)201425 Gbps100 Gbps
HDR (High Data Rate)201850 Gbps200 Gbps
NDR (Next Data Rate)2022100 Gbps400 Gbps
XDR (eXtreme Data Rate)2024–25200 Gbps800 Gbps

What you see in 2026:

  • New AI builds: NDR (400 Gbps) dominant, XDR (800 Gbps) for top-end clusters
  • Existing large clusters: HDR (200 Gbps) — Meta RSC, much of DGX A100 era
  • Legacy HPC: EDR (100 Gbps) still around

A "4×" port aggregates 4 lanes; an "8×" port (newer) aggregates 8. Vendor datasheets sometimes quote per-lane rate, sometimes per-port — read carefully.


What you should remember

  • Credit-based flow control = no drops, ever. Sender holds back before sending; PFC tries to do this after the fact and is messier.
  • Subnet Manager = one brain. Centralized, simple to reason about; doesn't scale to hyperscale port counts the way distributed BGP does.
  • LID is the MAC analog, GID is the IPv6 analog. Most single-subnet clusters use LIDs almost exclusively.
  • IB's BTH + RETH transport is unchanged in RoCE v2. RoCE v2 = IB transport on commodity Ethernet.
  • NDR (400 G) is the current AI default, XDR (800 G) is starting to ship.

Next section: RoCE v2 → — the same IB transport on commodity Ethernet, and the fabric this curriculum picks.