Skip to main content

ibv_devinfo Decoded

ibv_devinfo is the canonical "what's this RDMA NIC?" tool. It's also the most overloaded command output you'll meet on an RDMA host — fifteen lines of jargon, most of it inherited from the InfiniBand spec, half of it meaningless on a RoCE box, and the half that is meaningful encoded in a way that requires a decoder ring.

Your job is to know what ibv_devinfo tells you and — just as importantly — what it doesn't. When someone Slacks you a screenshot at 2am asking "is this NIC OK?", you should be able to read every line and answer in under a minute.

This page is that decoder ring.

Watch ibv_devinfo -v run on the rockynet lab simulator, with each field annotated as it scrolls — and then the show_gids cross-reference + ibstatus summary, so you've seen the three commands you'd actually type on a real host:

MODULE rdma · LAB 2Watch the recording — every command, every counter, every output.

The full output, line by line

Here's a representative dump from one backend NIC on a GPU host running RoCE v2 over Ethernet:

hca_id: ib0
transport: InfiniBand (0)
fw_ver: 28.39.1002
node_guid: a088:c203:005f:c296
sys_image_guid: a088:c203:005f:c296
vendor_id: 0x02c9
vendor_part_id: 4129
hw_ver: 0x0
board_id: MT_0000000838
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet

Every field has a story. Let's walk them.


hca_id — what the OS calls this NIC

HCA = Host Channel Adapter. This is the InfiniBand-world term for a NIC, inherited from the IB spec, and it stuck. Every RDMA tool refers to NICs as HCAs even when the wire is plain Ethernet.

ib0 is the device name as known to the RDMA subsystem. On many hosts, the RDMA name (mlx5_0) and the netdev name (enp1s0f0) are different — udev rules can be configured to force them to match, which simplifies tooling and human debugging. On the host above, they match: ib0 is both the RDMA HCA and the Linux interface.

Why you'll care: when you tell NCCL or ib_send_bw which NIC to use, you give it the HCA ID:

NCCL_IB_HCA=ib0
ib_write_bw -d ib0 ...

If you give the netdev name where the HCA name is expected, the tool quietly picks the wrong device — or no device at all.


transport — the protocol semantics

transport: InfiniBand (0)

This means the verbs API and protocol semantics follow the InfiniBand specification. On any Mellanox/NVIDIA NIC, this is always "InfiniBand" — regardless of whether the wire is actual IB cabling or RoCE over commodity Ethernet. The link-layer choice lives one field down (link_layer); transport is purely about what the upper-layer protocol thinks it's speaking.

Possible values you might see:

ValueMeaning
InfiniBand (0)IB-spec semantics. The default on every modern Mellanox NIC.
iWARP (1)Alternative RDMA protocol layered over standard TCP. Effectively extinct in 2026.

If you ever see iWARP on a Mellanox NIC, something is very wrong — those NICs don't support iWARP. If you see it on a Chelsio T-series NIC, that's expected.


fw_ver — firmware version

Every modern NIC is essentially a small computer. It has its own embedded CPU and firmware. Firmware controls link negotiation, RDMA processing, congestion control logic, telemetry, ASIC offloads — basically everything that doesn't go through the host CPU.

The version format is <major>.<minor>.<build>. The major number tracks the NIC generation:

Firmware majorNIC family
12.xConnectX-4
14.xConnectX-4 Lx
16.xConnectX-5
20.xConnectX-6
22.xConnectX-6 Dx
26.xConnectX-6 Lx
28.xConnectX-7
32.xConnectX-8

Why you'll care: many tunables — DCQCN parameters, PFC behavior, advanced congestion-control modes, certain debug counters — require minimum firmware versions. When a feature mysteriously doesn't work, fw_ver is the first thing to check against the vendor's release notes.

To upgrade: mlxfwmanager from the NVIDIA MFT (Mellanox Firmware Tools) package. It's not always installed by default on a hardened host image — install it on a jump host or in a debug container, not on the GPU node itself.


node_guid and sys_image_guid — the NIC's globally unique ID

node_guid: a088:c203:005f:c296
sys_image_guid: a088:c203:005f:c296

GUID = Globally Unique Identifier. A 64-bit number that uniquely identifies this NIC anywhere in the world. Think of it as the MAC address but for RDMA.

The shape: vendor OUI in the first 24 bits, per-NIC value in the rest. a088:c2 is one of the Mellanox/NVIDIA OUIs assigned by the IEEE — if you ever see a GUID that doesn't start with a known Mellanox OUI on what's supposed to be a Mellanox card, something's wrong.

How GUIDs get used depends on the link layer:

Link layerRole of the GUID
InfiniBandUsed directly for L3 addressing. No MAC, no IP — packets are addressed by GUID-derived GID.
RoCE v1Encoded into a MAC-derived GID. Largely deprecated.
RoCE v2The interface's IPv4 or IPv6 address gets mapped into the GID field; the underlying GUID still exists as the NIC's identity, but addressing on the wire is IP-based.

node_guid is the NIC's identity. sys_image_guid is the identity of the entire HCA module (i.e., the physical card). On a single-port card, they're equal. On a dual-port card, both ports share a sys_image_guid but have different node_guids — useful for telling "is this the same physical NIC?" apart from "is this the same port?"

GID, GUID, EUI-64 — the derivation chain

The 64-bit GUID is also called an EUI-64. You can derive a 128-bit IPv6 link-local GID from a MAC address by:

  1. Splitting the 48-bit MAC in half
  2. Inserting ff:fe between the two halves
  3. Flipping the locally-administered bit
  4. Prepending fe80::

That gives you the GID you'll see in show_gids for GID index 0 on a RoCE NIC. The higher GID indices on a RoCE NIC are derived from the assigned IPv4 (mapped into an IPv4-mapped IPv6 form) and IPv6 addresses. Each combination of (address, RoCE version) gets its own GID index. That's why a single RoCE NIC can show four or more GIDs.

When NCCL or ib_write_bw says "GID index 3" — it's picking one of these. Pick the wrong one (e.g. RoCE v1 when the fabric expects RoCE v2) and the connection silently fails.


vendor_id and vendor_part_id — what model NIC

vendor_id: 0x02c9 ← Mellanox/NVIDIA
vendor_part_id: 4129 ← ConnectX-7

Vendor IDs are PCI-SIG assignments. 0x02c9 is permanently Mellanox / NVIDIA Networking. You'll see the same value in lspci -nn on any Mellanox card.

The vendor_part_id is the part-number decoder. Memorize the modern values or keep this table handy:

IDModelGenerationTypical speed
4099ConnectX-3 ProGen-340 / 56 Gbps
4115ConnectX-4Gen-4100 Gbps
4117ConnectX-4 LxGen-4 (low-cost)25 / 50 Gbps
4119ConnectX-5Gen-5100 Gbps
4121ConnectX-5 ExGen-5100 Gbps (PCIe Gen4)
4123ConnectX-6Gen-6200 Gbps
4125ConnectX-6 DxGen-6100 / 200 Gbps
4127ConnectX-6 LxGen-6 (low-cost)25 / 50 Gbps
4129ConnectX-7Gen-7200 / 400 Gbps
4131ConnectX-8Gen-8400 / 800 Gbps

Same IDs show up in lspci -n -d 15b3: — the 15b3:XXXX PCI ID is the vendor + part_id pair. So lspci and ibv_devinfo will agree, and you can identify NICs without RDMA tools installed if you have to.

Why you'll care: firmware, driver versions, supported features, PCIe generation, max speed — all are gated by NIC model. "Why is this 400G link only running at 200G?" is sometimes a CX-6 in a slot expected to hold a CX-7.


hw_ver and board_id — silicon revision and OEM SKU

hw_ver: 0x0
board_id: MT_0000000838

hw_ver is the silicon stepping. Usually 0x0 — non-zero values are revisions that occasionally matter for firmware compatibility, almost never for day-to-day operations.

board_id is the Mellanox manufacturing board ID — the exact OEM SKU. Different board_ids mean different physical layouts: single vs. dual port, different cooling, different OEM customizations (Dell SKU vs. HPE SKU vs. NVIDIA reference). Firmware is matched to board_id — you can't flash a single-port-card firmware onto a dual-port board, even if they're both CX-7.

MT_0000000838 is one common CX-7 single-port reference design. A dual-port CX-6 Dx board, by contrast, might show MT_0000000327. The numbers are opaque without NVIDIA's reference list, but they're stable identifiers — you can grep your fleet inventory by board_id to find every host with the same physical card.


phys_port_cnt — physical ports on this NIC

phys_port_cnt: 1

How many physical ports this device exposes. The host above has single-port 400G CX-7 cards — one QSFP-DD cage on the bracket, one port reported.

A dual-port card behaves differently than you might expect: each port appears as a separate RDMA device. A dual-port CX-6 Dx will show up as mlx5_1 and mlx5_2 in ibv_devices, each with phys_port_cnt: 1, sharing one sys_image_guid but with different node_guids. The lspci view will agree — two PCI functions (e.g. 26:00.0 and 26:00.1), one per port.

Heads-up for CX-8 / B300 hosts: dual-port 800G NICs with breakout-to-2×400G modes mean you'll see phys_port_cnt: 2 on some configurations and 1 on others, depending on the breakout config. Don't infer card count from device count alone.


state: PORT_ACTIVE (4)

The numeric IB states:

ValueNameMeaning
1PORT_DOWNPhysical link not up
2PORT_INITIALIZEIB-only: Subnet Manager bringing port up
3PORT_ARMEDIB-only: partial state during SM handshake
4PORT_ACTIVELink up, ready to carry traffic

On RoCE, you only ever see PORT_DOWN or PORT_ACTIVE. The intermediate states are artifacts of InfiniBand's SM-driven bring-up. So if you're staring at a RoCE NIC stuck in PORT_INITIALIZE, something is very wrong — the NIC thinks it's in IB mode, the link layer is misconfigured, or the driver is confused. Check link_layer immediately.


max_mtu and active_mtu — RDMA MTU, not Ethernet MTU

This is the most confusing thing about RDMA, and the source of an evergreen operator gotcha. There are two MTUs at play on every RoCE NIC, they mean different things, and ibv_devinfo only shows one of them.

max_mtu: 4096 (5)
active_mtu: 4096 (5)

Two MTUs, two layers

+------------------------------------+
| Application data |
| (any size: 1 byte → 2 GB) |
+------------------------------------+
|
v
+------------------------------------+
| RDMA messages |
| (split into RDMA packets |
| of size = active_mtu) | ← RDMA MTU layer
+------------------------------------+
|
v
+------------------------------------+
| UDP / IP packet |
| Ethernet frame |
| (size ≤ ip link mtu, | ← Ethernet MTU layer
| e.g. 9000 for jumbo) |
+------------------------------------+
  • RDMA MTU (active_mtu in ibv_devinfo) — how big each RDMA-level packet is. The NIC silicon segments your large messages into chunks of this size for transport. Choices are fixed by the IB spec: 256, 512, 1024, 2048, 4096 bytes. You cannot go larger than 4096.
  • Ethernet MTU (set with ip link set dev ib0 mtu 9000) — how big each L2 frame can be on the wire. Standard Ethernet is 1500; "jumbo" deployments use 9000.

The (5) next to 4096 is just the enum index — the IB spec encodes the MTU as a small integer:

Enum valueRDMA MTU bytes
1256
2512
31024
42048
54096

Why 4096 on a jumbo-frame RoCE fabric

A 4096-byte RDMA packet plus UDP/IP headers plus the RoCE v2 BTH (Base Transport Header) lands at roughly 4150 bytes on the wire. That fits comfortably inside a 9000-byte Ethernet frame with headroom for VLAN tags and other encapsulation.

You can't go bigger than 4096 for the RDMA packet — that's the IB spec maximum. If your Ethernet MTU is 9000, the wasted space (about 4850 bytes per frame) is the cost of running RDMA over Ethernet. It's still worth it because the alternative — running with a 1500-byte Ethernet MTU and a forced 1024-byte RDMA MTU — is much worse for throughput.

The canonical mismatch: active_mtu: 1024 on a card you thought was 9000

Here's the gotcha. If you provision the NIC with Ethernet MTU 1500 (the default), the driver negotiates the RDMA MTU down to 1024 — because a 2048-byte RDMA packet won't fit in a 1500-byte Ethernet frame. You'll see:

max_mtu: 4096 (5) ← what the NIC could do
active_mtu: 1024 (3) ← what it's actually doing

When this appears on a "backend" NIC you expected to be jumbo, you have a config drift. Check ip link show ib0 — odds are the netdev MTU is 1500 instead of 9000. Fix the netdev MTU, bounce the link, the RDMA MTU re-negotiates up.

How to inspect the Ethernet MTU

ip link show dev ib0 | grep mtu
ethtool ib0 | grep -i mtu # not all NICs surface this

You should see mtu 9000 for backend RoCE NICs that carry training traffic, mtu 1500 for any frontend / management NICs that only carry control plane.

Operator reflex: if active_mtu is anything other than 4096 (5) on a backend NIC, your training fabric is silently leaving throughput on the table. It won't show up as an error — it'll show up as worse-than-expected step times.


sm_lid, port_lid, port_lmc — InfiniBand-only, zero on RoCE

sm_lid: 0
port_lid: 0
port_lmc: 0x00

These three fields are pure InfiniBand. They mean nothing on a RoCE box. But they're always printed, and that confuses people, so:

  • LID = Local Identifier. A 16-bit number assigned by the InfiniBand Subnet Manager (SM) to every port in the IB fabric. Used as the L2 address inside an IB subnet.
  • SM = Subnet Manager. A daemon (OpenSM or NVIDIA UFM) that runs somewhere in the IB fabric and is responsible for discovery, address assignment, route programming, and fault recovery.
  • port_lmc = LID Mask Control. Controls a range of LIDs assigned to one port for multipath. Esoteric even in pure-IB shops.

Why they're all zero on RoCE: Ethernet has no Subnet Manager. Addressing is MAC + IP, assigned by DHCP or static config — the same way every other Ethernet device on the planet works. There's no centralized fabric controller assigning LIDs, so the LID fields stay at their default zero value.

This is actually one of the big operational advantages of RoCE over InfiniBand: no SM to run, no SM to debug, no SM to fail over. IB requires you to operate an SM daemon as part of the fabric; RoCE just uses standard IP networking. Less control-plane software, more familiar tooling.

If you ever see a non-zero sm_lid on a RoCE NIC, something is genuinely broken — the NIC thinks it's in an IB subnet and is going to behave weirdly.


link_layer: Ethernet

This is the line. Two possible values:

ValueMeaning
EthernetRoCE — the NIC is talking standard Ethernet on the wire, with IB transport encapsulated in UDP.
InfiniBandTrue IB — the NIC is talking native InfiniBand.

transport: InfiniBand and link_layer: Ethernet together is the canonical RoCE signature. It looks contradictory if you don't know what each field means — transport is the upper-layer semantics, link_layer is what's on the wire. RoCE wraps IB transport in UDP/IP/Ethernet.

If you're trying to confirm "is this NIC running RoCE or IB?", link_layer is the single field to grep.

ibv_devinfo -d ib0 | grep link_layer

If it says Ethernet, you're on RoCE. If it says InfiniBand, you're on IB. There is no in-between.


Putting it together — a fleet inventory at a glance

After you've decoded one host's ibv_devinfo, the structure for the whole fleet falls out naturally. A typical 8-GPU RoCE server reports something like:

HCAModelModeStateRDMA MTUFirmwareRole
ib0CX-7EthernetActive409628.39.1002Backend rail 0
ib1CX-7EthernetActive409628.39.1002Backend rail 1
ib2CX-7EthernetActive409628.39.1002Backend rail 2
ib3CX-7EthernetActive409628.39.1002Backend rail 3
mlx5_1CX-6 DxEthernetActive102422.38.1002Frontend / management

Four backend rails on CX-7 at 400G with jumbo frames (RDMA MTU 4096), and one frontend NIC on CX-6 Dx at 100G with standard frames (RDMA MTU 1024). That active_mtu: 1024 on the frontend isn't a bug — it's correct, because the frontend NIC sits on a standard-MTU management network. The backends are where line-rate matters and where every byte of MTU buys you throughput.

This is the kind of summary you should be able to write from a fresh host you've never logged into, in under five minutes.


What you should remember

  • hca_id is the RDMA name (ib0, mlx5_0), distinct from the netdev name. Tools like NCCL want the HCA name.
  • transport: InfiniBand is always set on Mellanox NICs — it describes the upper-layer semantics, not the wire.
  • link_layer is the field that tells you RoCE vs. InfiniBand. Ethernet = RoCE. InfiniBand = native IB.
  • fw_ver major numbers map to NIC generation: 22.x = CX-6 Dx, 28.x = CX-7, 32.x = CX-8. Always check fw_ver first when a feature isn't working.
  • vendor_part_id decodes to a NIC model. 4119 = CX-5, 4123 = CX-6, 4129 = CX-7, 4131 = CX-8. Memorize the modern values.
  • node_guid is the NIC identity; sys_image_guid is the physical card identity. Same on single-port cards; different on dual-port.
  • GIDs are derived from the GUID (index 0) or from assigned IPs (higher indices). Picking the wrong GID index silently breaks RoCE v2 connections.
  • max_mtu / active_mtu are RDMA MTU, NOT Ethernet MTU. RDMA MTU caps at 4096 by IB spec. Ethernet MTU is set with ip link. active_mtu: 1024 on a jumbo-frame fabric is a config drift, not a feature.
  • sm_lid, port_lid, port_lmc are zero on RoCE because Ethernet has no Subnet Manager. Non-zero on RoCE = something is wrong.
  • RoCE's operational win over IB is exactly this: no SM to run, no LIDs to manage, just standard IP networking.

Next: InfiniBand → — the native RDMA fabric, where all the IB-only fields above (sm_lid, port_lid, PORT_INITIALIZE) actually come alive. Then RoCE v2 for how the IB transport rides UDP/IP/Ethernet on commodity gear.