Under the Hood
You know what InfiniBand is from the previous page. Now how it actually works — the four mechanics a network engineer needs to understand to read a SuperPOD config or compare it sensibly with RoCE v2:
- Credit-based flow control — why IB can't drop
- The Subnet Manager — one brain runs the fabric
- Addressing — LID + GID instead of MAC + IP
- The wire format — what's on the link
Plus a quick reference on speed grades (SDR → XDR) so the alphabet soup in vendor data sheets makes sense.
1. Credit-based flow control
The defining feature. The thing RoCE v2 spends PFC + ECN trying to emulate. Worth understanding deeply.
The rule: the sender can only transmit if the receiver has explicitly granted credit for that packet. No credit, no packet. Period.
How the math works:
- Every IB port has a receive buffer of fixed size (e.g. 64 KB per virtual lane).
- The receiver advertises buffer credits to its peer — "I can absorb N more 64-byte chunks of your data."
- The sender keeps a counter. Each packet sent decrements by the packet's size; each credit grant increments by the granted amount.
- If the sender's counter would go negative, the sender does not send. Hardware enforced, no software involvement.
Why this matters versus PFC (RoCE v2's substitute):
| Aspect | InfiniBand credits | RoCE v2 PFC |
|---|---|---|
| When the pause happens | Before transmission (sender holds back) | After congestion (PFC frame sent upstream from the congested point) |
| Drop possible? | No — sender literally can't send | Yes, if buffers fill faster than PFC can react |
| Backpressure granularity | Per virtual lane (8 VLs) | Per traffic class (8 TCs) |
| Buffer required at receiver | Bounded by advertised credits | Sized for ~RTT × bandwidth (huge at 400G) |
| Tuning knobs | Few — fabric is self-tuning | Many — buffer thresholds, PFC headroom, ECN watermark, all of it |
The IB approach is structurally simpler. The RoCE v2 approach is operationally cheaper (commodity Ethernet) but you spend the savings on careful tuning.
2. The Subnet Manager — one brain runs the fabric
InfiniBand has a centralized control plane. The Subnet Manager (SM) is a daemon that runs somewhere on the fabric (typically on a management server or a switch with embedded SM). It discovers every switch and port, assigns addresses, computes routing tables, and pushes them to every switch.
What the SM actually does:
- Sweeps the fabric every few seconds to discover topology changes
- Assigns LIDs (Local Identifiers) to every port — basically picks a 16-bit address for everyone
- Computes routing tables for every switch (typically Up*/Down* routing in HPC, fat-tree in AI)
- Pushes routing tables via MAD (Management Datagram) packets on a special QP
- Configures partitions (P_Keys — IB's equivalent of VLANs)
- Configures QoS (service levels mapped to virtual lanes)
The good: one source of truth. Want to know why a path is what it is? Ask the SM. Want to reroute? Restart the SM or use smpquery. Configure once, push everywhere.
The bad: the SM is a single logical entity. Most deployments run an active SM + a standby SM on a second node for HA, but the model still requires a designated brain. Compared to Ethernet where every switch is autonomous, the SM-based model is operationally different.
Hyperscalers picked Ethernet partly because of this — distributed BGP scales horizontally, an SM-based fabric scales by making the SM faster. At 100K+ ports the math eventually breaks for IB.
3. Addressing — LID + GID
InfiniBand has two address levels, like Ethernet has MAC + IP:
| IB | Ethernet analog | Size | Scope |
|---|---|---|---|
| LID (Local Identifier) | MAC address | 16 bits | Within one subnet (one SM domain) |
| GID (Global Identifier) | IPv6 address | 128 bits | Across subnets — used when routing through IB routers |
For most clusters, the entire fabric is one subnet — there's only one SM, one set of LIDs. GIDs exist for inter-subnet routing but rarely fire in a single cluster. NCCL on IB primarily uses LIDs.
The LID assignment process:
- Each NIC/switch port comes up with no LID
- The SM discovers it via the discovery sweep
- The SM assigns a LID from its pool
- The SM updates routing tables fabric-wide so this LID is reachable
A LID is only 16 bits → ~48K usable values per subnet. Big enough for the largest practical IB fabrics, small enough that the routing table fits in a switch's TCAM.
GIDs are 128 bits and look like IPv6 addresses. You saw the GID format in the RDMA section — the same (GID, QP num) pair identifies an endpoint on either fabric. On IB the GID is assigned by the SM (usually deterministically derived from the LID + a subnet prefix). On RoCE v2 it's derived from the IP.
4. The wire format
What an InfiniBand packet looks like on the link — and side by side with RoCE v2 to show the lineage:
Key headers:
- LRH (Local Routing Header) — 8 bytes. Contains the destination LID and service level. Switches forward on this. Replaced by Ethernet + IP in RoCE v2.
- BTH (Base Transport Header) — 12 bytes. Carries the destination QP number, opcode (SEND / WRITE / READ / ATOMIC), PSN (packet sequence number). This is the IB transport. Identical bytes in IB and RoCE v2.
- RETH (RDMA Extended Transport Header) — 16 bytes. Only for one-sided ops. Carries the remote address + rkey + length.
- ICRC — invariant CRC over fields that don't change in transit
- VCRC — variant CRC over the whole packet, recomputed at each hop (IB only)
The whole point of this comparison: RoCE v2 didn't reinvent the transport. It took IB's BTH + RETH unchanged and bolted Ethernet + IP + UDP underneath. The semantics of one packet are identical.
5. Speed grades — the alphabet soup
InfiniBand speed names are letter abbreviations that change every few years. Here's the cheat sheet:
| Grade | Year | Per-lane rate | Typical 4× port |
|---|---|---|---|
| SDR (Single Data Rate) | 2003 | 2.5 Gbps | 10 Gbps |
| DDR (Double Data Rate) | 2005 | 5 Gbps | 20 Gbps |
| QDR (Quad Data Rate) | 2008 | 10 Gbps | 40 Gbps |
| FDR (Fourteen Data Rate) | 2011 | 14 Gbps | 56 Gbps |
| EDR (Enhanced Data Rate) | 2014 | 25 Gbps | 100 Gbps |
| HDR (High Data Rate) | 2018 | 50 Gbps | 200 Gbps |
| NDR (Next Data Rate) | 2022 | 100 Gbps | 400 Gbps |
| XDR (eXtreme Data Rate) | 2024–25 | 200 Gbps | 800 Gbps |
What you see in 2026:
- New AI builds: NDR (400 Gbps) dominant, XDR (800 Gbps) for top-end clusters
- Existing large clusters: HDR (200 Gbps) — Meta RSC, much of DGX A100 era
- Legacy HPC: EDR (100 Gbps) still around
A "4×" port aggregates 4 lanes; an "8×" port (newer) aggregates 8. Vendor datasheets sometimes quote per-lane rate, sometimes per-port — read carefully.
What you should remember
- Credit-based flow control = no drops, ever. Sender holds back before sending; PFC tries to do this after the fact and is messier.
- Subnet Manager = one brain. Centralized, simple to reason about; doesn't scale to hyperscale port counts the way distributed BGP does.
- LID is the MAC analog, GID is the IPv6 analog. Most single-subnet clusters use LIDs almost exclusively.
- IB's BTH + RETH transport is unchanged in RoCE v2. RoCE v2 = IB transport on commodity Ethernet.
- NDR (400 G) is the current AI default, XDR (800 G) is starting to ship.
Next section: RoCE v2 → — the same IB transport on commodity Ethernet, and the fabric this curriculum picks.