Under the Hood
You know what InfiniBand is from the previous page. Now how it actually works — the four mechanics a network engineer needs to understand to read a SuperPOD config or compare it sensibly with RoCE v2:
- Credit-based flow control — why IB can't drop
- The Subnet Manager — one brain runs the fabric
- Addressing — LID + GID instead of MAC + IP
- The wire format — what's on the link
Plus a quick reference on speed grades (SDR → XDR) so the alphabet soup in vendor data sheets makes sense.
- Walk through credit-based flow control — the receiver advertises buffer credits (e.g. 64 KB per virtual lane), the sender's counter goes negative-forbidden, and drop is structurally impossible before transmission — versus PFC pausing after congestion.
- Explain the Subnet Manager —
opensm/UFM sweeps the fabric, assigns LIDs, computes routing, and pushes tables via MAD packets, plus why one-brain control doesn't scale to 100K+ ports like distributed BGP. - Distinguish LID from GID — the 16-bit (~48K) MAC analog scoped to one subnet versus the 128-bit IPv6-like GID for inter-subnet routing, and read
ibstat/sminfo/ibroute. - Parse the IB wire format and speed grades — LRH / BTH / RETH / ICRC / VCRC, why RoCE v2 reuses BTH + RETH byte-for-byte, and that NDR (400 G) is the 2026 AI default with XDR (800 G) shipping.
1. Credit-based flow control
The defining feature. The thing RoCE v2 spends PFC + ECN trying to emulate. Worth understanding deeply.
The rule: the sender can only transmit if the receiver has explicitly granted credit for that packet. No credit, no packet. Period.
How the math works:
- Every IB port has a receive buffer of fixed size (e.g. 64 KB per virtual lane).
- The receiver advertises buffer credits to its peer — "I can absorb N more 64-byte chunks of your data."
- The sender keeps a counter. Each packet sent decrements by the packet's size; each credit grant increments by the granted amount.
- If the sender's counter would go negative, the sender does not send. Hardware enforced, no software involvement.
How credits actually move — flow control packets:
The credits aren't magic. They ride dedicated link-layer flow control packets (FCPs) that each port emits periodically, separate from data traffic. Two counters do all the work, both measured in credits = 64-byte blocks:
- FCTBS (Flow Control Total Blocks Sent) — a running total of blocks the sender has transmitted on this VL, carried in the data direction so the receiver can reconcile its accounting even if a packet was corrupted.
- FCCL (Flow Control Credit Limit) — the receiver's "you may send up to here" ceiling, derived from its free buffer and carried back in the FCP.
The sender may clock out the next packet only while it stays under the FCCL ceiling. Two properties make this robust where PFC is fragile:
- It's an absolute ceiling, not a delta. FCCL is "you may send up to block N," not "here are 4 more credits." So a lost FCP self-corrects on the very next one — there's no credit drift, no stuck-PAUSE equivalent, no watchdog needed.
- Periodic refresh. Each port sends an FCP at least every 65,536 symbol times even on an idle link, so the credit state is continuously re-asserted.
- Per virtual lane. IB multiplexes up to 15 data virtual lanes (VLs) on one physical link (8 is typical in shipping silicon), each with its own buffer and its own independent credit loop. Service Levels map to VLs, so a stalled VL can't head-of-line-block a healthy one — the same job PFC does across 8 priority classes, but lossless by construction instead of by threshold-and-headroom.
Why this matters versus PFC (RoCE v2's substitute):
| Aspect | InfiniBand credits | RoCE v2 PFC |
|---|---|---|
| When the pause happens | Before transmission (sender holds back) | After congestion (PFC frame sent upstream from the congested point) |
| Drop possible? | No — sender literally can't send | Yes, if buffers fill faster than PFC can react |
| Backpressure granularity | Per virtual lane (8 VLs) | Per traffic class (8 TCs) |
| Buffer required at receiver | Bounded by advertised credits | Sized for ~RTT × bandwidth (huge at 400G) |
| Tuning knobs | Few — fabric is self-tuning | Many — buffer thresholds, PFC headroom, ECN watermark, all of it |
Two ways to be lossless — this is the whole fork in the road. InfiniBand is lossless by construction: credits mean the receiver's buffer can never be oversubscribed, so there is nothing to tune and no headroom to size. RoCE v2 is lossless by reaction: Ethernet has no credit mechanism, so PFC approximates one — let the buffer fill toward a threshold, fire a PAUSE upstream, and reserve headroom (2 × prop × bandwidth, ≈ 50 KB per 100 m at 400G) to catch the bytes already in flight when the PAUSE was sent. Credits prevent the overflow; PFC races to stop it. That's why IB needs near-zero congestion tuning while RoCE needs PFC thresholds and ECN watermarks and DCQCN — three subsystems doing what one credit counter does natively. You trade IB's structural elegance for Ethernet's economics and L3 routability, and you pay the difference in tuning.
2. The Subnet Manager — one brain runs the fabric
InfiniBand has a centralized control plane. The Subnet Manager (SM) is a daemon that runs somewhere on the fabric (typically on a management server or a switch with embedded SM). It discovers every switch and port, assigns addresses, computes routing tables, and pushes them to every switch.
What the SM actually does:
- Sweeps the fabric every few seconds to discover topology changes
- Assigns LIDs (Local Identifiers) to every port — basically picks a 16-bit address for everyone
- Computes routing tables for every switch (typically Up*/Down* routing in HPC, fat-tree in AI)
- Pushes routing tables via MAD (Management Datagram) packets on a special QP
- Configures partitions (P_Keys — IB's equivalent of VLANs)
- Configures QoS (service levels mapped to virtual lanes)
The good: one source of truth. Want to know why a path is what it is? Ask the SM. Want to reroute? Restart the SM or use smpquery. Configure once, push everywhere.
The bad: the SM is a single logical entity. Most deployments run an active SM + a standby SM on a second node for HA, but the model still requires a designated brain. Compared to Ethernet where every switch is autonomous, the SM-based model is operationally different.
Hyperscalers picked Ethernet partly because of this — distributed BGP scales horizontally, an SM-based fabric scales by making the SM faster. At 100K+ ports the math eventually breaks for IB.
3. Addressing — LID + GID
InfiniBand has two address levels, like Ethernet has MAC + IP:
| IB | Ethernet analog | Size | Scope |
|---|---|---|---|
| LID (Local Identifier) | MAC address | 16 bits | Within one subnet (one SM domain) |
| GID (Global Identifier) | IPv6 address | 128 bits | Across subnets — used when routing through IB routers |
For most clusters, the entire fabric is one subnet — there's only one SM, one set of LIDs. GIDs exist for inter-subnet routing but rarely fire in a single cluster. NCCL on IB primarily uses LIDs.
The LID assignment process:
- Each NIC/switch port comes up with no LID
- The SM discovers it via the discovery sweep
- The SM assigns a LID from its pool
- The SM updates routing tables fabric-wide so this LID is reachable
A LID is only 16 bits → ~48K usable values per subnet. Big enough for the largest practical IB fabrics, small enough that the routing table fits in a switch's TCAM.
GIDs are 128 bits and look like IPv6 addresses. You saw the GID format in the RDMA section — the same (GID, QP num) pair identifies an endpoint on either fabric. On IB the GID is assigned by the SM (usually deterministically derived from the LID + a subnet prefix). On RoCE v2 it's derived from the IP.
See the IB-only commands
ibstat, ibhosts, sminfo, ibroute — the IB-specific tools that don't exist for Ethernet:
Highlights: ibstat showing Link layer: InfiniBand (not Ethernet), a LID assigned by the SM, sminfo showing the SM in MASTER state, and ibroute showing the SM-pushed routing table for a destination LID.
4. The wire format
What an InfiniBand packet looks like on the link, top to bottom:
- LRH (Local Routing Header) — 8 bytes. Contains the destination LID and service level. Switches forward on this. (Replaced by Ethernet + IP in RoCE v2.)
- BTH (Base Transport Header) — 12 bytes. Carries the destination QP number, opcode (SEND / WRITE / READ / ATOMIC), PSN (packet sequence number). This is the IB transport.
- RETH (RDMA Extended Transport Header) — 16 bytes. Only for one-sided ops. Carries the remote address + rkey + length.
- ICRC — invariant CRC over fields that don't change in transit.
- VCRC — variant CRC over the whole packet, recomputed at each hop (IB only).
The key thing about BTH + RETH: RoCE v2 reuses them byte-for-byte. That's why the verbs API is identical across both fabrics — the transport is the same, only the layers underneath change. The full side-by-side byte layout (InfiniBand vs RoCE v2 vs TCP/IP) lives in What RoCE v2 Is, where that reuse is the headline.
5. Speed grades — the alphabet soup
InfiniBand speed names are letter abbreviations that change every few years. Here's the cheat sheet:
| Grade | Year | Per-lane rate | Typical 4× port |
|---|---|---|---|
| SDR (Single Data Rate) | 2003 | 2.5 Gbps | 10 Gbps |
| DDR (Double Data Rate) | 2005 | 5 Gbps | 20 Gbps |
| QDR (Quad Data Rate) | 2008 | 10 Gbps | 40 Gbps |
| FDR (Fourteen Data Rate) | 2011 | 14 Gbps | 56 Gbps |
| EDR (Enhanced Data Rate) | 2014 | 25 Gbps | 100 Gbps |
| HDR (High Data Rate) | 2018 | 50 Gbps | 200 Gbps |
| NDR (Next Data Rate) | 2022 | 100 Gbps | 400 Gbps |
| XDR (eXtreme Data Rate) | 2024–25 | 200 Gbps | 800 Gbps |
What you see in 2026:
- New AI builds: NDR (400 Gbps) dominant, XDR (800 Gbps) for top-end clusters
- Existing large clusters: HDR (200 Gbps) — Meta RSC, much of DGX A100 era
- Legacy HPC: EDR (100 Gbps) still around
A "4×" port aggregates 4 lanes; an "8×" port (newer) aggregates 8. Vendor datasheets sometimes quote per-lane rate, sometimes per-port — read carefully.
💡 What you should remember
| # | Concept | Why it matters | |
|---|---|---|---|
| 1 | 🚫 | Credit-based flow control = no drops, ever | Sender holds back before sending; PFC tries to do this after the fact and is messier. |
| 2 | 🧠 | Subnet Manager = one brain | Centralized, simple to reason about; doesn't scale to hyperscale port counts the way distributed BGP does. |
| 3 | 🌐 | LID is the MAC analog, GID is the IPv6 analog | Most single-subnet clusters use LIDs almost exclusively. |
| 4 | 🔁 | IB's BTH + RETH transport is unchanged in RoCE v2 | RoCE v2 = IB transport on commodity Ethernet. |
| 5 | ⚡ | NDR (400 G) is the current AI default, XDR (800 G) is starting to ship | The speed grade you'll quote on new builds. |
Next section: RoCE v2 → — the same IB transport on commodity Ethernet, and the fabric this curriculum picks.