Multi-Rail Source Routing on the Host

You have 4 RoCE NICs on this host — ib0, ib1, ib2, ib3. Every rail goes to a different leaf switch. Every NIC has its own IP. The host must send traffic out the correct NIC — the one whose rail is its IP's home — or PFC, DCQCN, and source-NUMA all break.

By default, Linux can't do this. Out of the box, Linux will:

Send all outgoing traffic via whichever NIC has the lowest-metric default route (usually ib0).
Answer ARP requests for any IP via any NIC ("ARP flux").
Drop ingress packets that arrive on the "wrong" NIC (reverse-path filter).

For multi-rail RoCE to work, you have to tell Linux explicitly that each NIC owns specific traffic. That's source-based routing. It's a wall of ip rule, ip route, and /proc/sys settings. It looks scary. It's actually a tidy little system once you see the model.

If you can read ip rule show and ip route show table 102 and immediately tell whether multi-rail routing is set up correctly, you can debug 70% of multi-rail RoCE bugs.

After this page, you'll be able to

Explain why default routing breaks multi-rail — Linux asks "destination only," so 4 NICs in overlapping subnets all egress one NIC and three rails sit idle at ~25% utilization.
Build source-based routing — per-NIC tables (101–104), ip rule add from <nic-ip> lookup <table>, and prove it with ip route get <dst> from <src>.
Kill the silent ingress bugs — arp_ignore=1/arp_announce=2 to stop ARP flux, rp_filter=2 (loose) so legitimate packets aren't dropped, and accept_local=1 for intra-host loopback.
Validate the full host — run the audit script and per-NIC ib_send_bw, expecting ~1.5+ Tbps aggregate across 4×400G rails, with each NIC's port_xmit_data rising only during its own test.

How Linux routes packets by default

When your process sends a packet, the kernel asks one question:

"What's the destination IP? Which route in the main table has the best (longest-prefix) match?"

It does NOT ask:

Which NIC's QP created this packet?
Which source IP is on this packet?
What's the workload trying to do?

If all 4 of your RoCE rails are in the same /16 pod subnet, the kernel sees all 4 routes as candidates and picks ONE — whichever has the lowest metric. All your traffic goes out that one NIC. The other three rails sit idle.

   App opens QP on ib1
        |
        | (creates packet with src=10.0.2.14, dst=peer)
        v
   Linux kernel: "where does peer live?"
        |
        | Looks up dst in 'main' table
        | Finds: 10.0.0.0/16 via ib0 metric 100  (or whatever)
        v
   Sends packet out ib0  ← WRONG NIC!
        |
        v
   ib0's rail gets all 4 NICs worth of traffic; ib1/ib2/ib3 quiet

The destination-only question fails when:

All 4 NICs are in the same overarching subnet (the pod's address space)
You want traffic from a specific source IP to go out a specific NIC, not "whichever has the cheapest route"

Why this breaks RoCE specifically

In a rail-optimized topology, every NIC connects to its own leaf:

                  Pod
   +-----------------------------------+
   |  ASW (rail 1)   ASW (rail 2)   ASW (rail 3)   ASW (rail 4)
   |     |              |              |              |
   +-----|--------------|--------------|--------------|---
         |              |              |              |
   Host: ib0            ib1            ib2            ib3
         10.0.1.14      10.0.2.14      10.0.3.14      10.0.4.14
         /24            /24            /24            /24

Each NIC sits in a different /24 rail subnet. If you leave Linux's default routing in place:

It picks "best" route based on metric.
Everything exits ib0 if ib0 has the lowest metric.
The rail-2 switch never sees ib1's traffic — because the kernel didn't route via ib1.

Result:

3 of your 4 rails sit idle.
Your "lossless DCQCN tuned per rail" config does nothing — all traffic on rail 1.
PFC fires on ib0's switch because it's overloaded; ib1/ib2/ib3 stay quiet.
NCCL sees uneven throughput; jobs run at ~1/4 expected speed.

The fix in one sentence

Use source-based routing: tell Linux that traffic from each NIC's IP must egress through THAT specific NIC, via a per-NIC routing table.

The Linux routing system you didn't know existed

Most people think Linux has one routing table. It has up to 256 of them. You've only ever seen one — main — by default. The other 255 are sitting there waiting for you to populate them. Think of it like BGP RIBs: you can have many, each consulted under different conditions, and rules decide which one wins.

The three default tables

ip rule show

On a vanilla Linux box:

   from all lookup local
from all lookup main
from all lookup default

Table	ID	Purpose
`local`	255	Loopback and local IPs — anything destined for this host.
`main`	254	The default user-visible routing table.
`default`	253	Fallback (almost always empty).

Each rule says: "for traffic matching this rule, look up the named table." Rules are evaluated in order of priority (low number = high priority). It's exactly like a policy route-map matched in order.

Custom tables

You can define your own tables (1 through 252). Each gets its own set of routes. The art is in the rules that decide which table to consult.

The source-routing pattern

For each NIC:
   1. Create a dedicated routing table (101, 102, 103, 104)
   2. Add routes to that table:
        - Direct route for the NIC's subnet
        - Default route via the NIC's gateway
   3. Add an ip rule: "if source IP == NIC's IP, look up its table"

Flip the question

Source-based routing tells the kernel to ask a different question first:

"What's the SOURCE IP of this packet? Is there a rule that says 'for this source, use this table'?"

If yes, jump to that custom table. THAT table has only one default route — out the correct NIC. Decision made before the destination lookup even matters.

   App opens QP on ib1
        |
        | (creates packet with src=10.0.2.14, dst=peer)
        v
   Linux kernel: "is there a rule matching this source?"
        |
        | Looks up rule: from 10.0.2.14 lookup 102 ✓
        | Goes to table 102 (instead of main)
        v
   Table 102: "default route via ib1's gateway, dev ib1"
        |
        v
   Sends packet out ib1  ← CORRECT NIC ✓

Packets from each NIC's source IP now egress via that NIC's route — because the rule picks the correct table, which has the correct route.

The four pieces of the fix — at a glance

┌────────────────────────────────────────────────────────────────┐
│ 1. PER-NIC ROUTING TABLES (101, 102, 103, 104)                 │
│    Each table has just ONE default route via the right NIC     │
│                                                                │
│    table 101:  default via 10.0.1.1  dev ib0                   │
│    table 102:  default via 10.0.2.1  dev ib1                   │
│    table 103:  default via 10.0.3.1  dev ib2                   │
│    table 104:  default via 10.0.4.1  dev ib3                   │
└────────────────────────────────────────────────────────────────┘
                              ↑
                              │ consulted when matching rule fires
                              │
┌────────────────────────────────────────────────────────────────┐
│ 2. SOURCE-BASED RULES                                          │
│    "If source IP is X, look up table Y"                        │
│                                                                │
│    from 10.0.1.14  lookup 101                                  │
│    from 10.0.2.14  lookup 102                                  │
│    from 10.0.3.14  lookup 103                                  │
│    from 10.0.4.14  lookup 104                                  │
└────────────────────────────────────────────────────────────────┘

┌────────────────────────────────────────────────────────────────┐
│ 3. PER-NIC ARP TUNING                                          │
│    Each NIC only ARP-replies for its own IP                    │
│    Each NIC announces with its own IP                          │
│                                                                │
│    arp_ignore=1, arp_announce=2  (per NIC)                     │
└────────────────────────────────────────────────────────────────┘

┌────────────────────────────────────────────────────────────────┐
│ 4. LOOSE REVERSE-PATH FILTER                                   │
│    Don't drop ingress packets just because the reply path      │
│    would use a different NIC                                   │
│                                                                │
│    rp_filter=2  (everywhere)                                   │
└────────────────────────────────────────────────────────────────┘

Each piece solves one specific bug:

Pieces 1 + 2 → egress goes out the right NIC.
Piece 3 → switches learn the right MAC↔IP mapping per rail.
Piece 4 → ingress packets don't get silently dropped.

Source-based routing — the full picture

Let's build the picture step by step.

Step 1 — Each NIC has its own IP and gateway

ib0:  IP 10.0.1.14/24    gateway 10.0.1.1    (rail 1 leaf)
ib1:  IP 10.0.2.14/24    gateway 10.0.2.1    (rail 2 leaf)
ib2:  IP 10.0.3.14/24    gateway 10.0.3.1    (rail 3 leaf)
ib3:  IP 10.0.4.14/24    gateway 10.0.4.1    (rail 4 leaf)

Step 2 — Create one routing table per NIC

# Each table has TWO routes: subnet and default
ip route add 10.0.1.0/24 dev ib0  src 10.0.1.14  table 101
ip route add default     via 10.0.1.1  dev ib0   table 101

ip route add 10.0.2.0/24 dev ib1  src 10.0.2.14  table 102
ip route add default     via 10.0.2.1  dev ib1   table 102

# ...same for ib2 (table 103) and ib3 (table 104)

Step 3 — Add source-based rules

# For each NIC, "if source IP matches this NIC's IP, use that NIC's table"
ip rule add from 10.0.1.14 lookup 101 priority 1001
ip rule add from 10.0.2.14 lookup 102 priority 1002
ip rule add from 10.0.3.14 lookup 103 priority 1003
ip rule add from 10.0.4.14 lookup 104 priority 1004

The rule table now looks like:

   from all lookup local
from 10.0.1.14 lookup 101
from 10.0.2.14 lookup 102
from 10.0.3.14 lookup 103
from 10.0.4.14 lookup 104
from all lookup main
from all lookup default

Step 4 — Test the routing decision

# Verify routing path for traffic from each source IP
ip route get 10.0.2.99 from 10.0.1.14
#  → 10.0.2.99 from 10.0.1.14 via 10.0.1.1 dev ib0 table 101
#  → Uses ib0's table, exits via ib0 ✓

ip route get 10.0.2.99 from 10.0.2.14
#  → 10.0.2.99 from 10.0.2.14 dev ib1 table 102
#  → Uses ib1's table, exits via ib1 ✓

What's happening:

When the QP for ib1 wants to send a packet, the NIC stamps the packet's source IP as 10.0.2.14 (ib1's IP).
Linux sees source = 10.0.2.14, matches rule 1002, looks up table 102.
Table 102 has a default route via ib1 → packet egresses via ib1 ✓.

Each NIC's traffic stays on its rail. Multi-rail routing solved.

The ARP flux problem (and the fix)

Source routing handles egress. Ingress has its own problem: ARP flux.

What is ARP flux?

ARP = Address Resolution Protocol. When a switch needs to send a packet to 10.0.1.14, it broadcasts "who has 10.0.1.14?" The host's NIC with that IP should respond with its MAC. The switch then forwards the packet.

Default Linux behavior: ANY NIC will respond to ARP requests for ANY of the host's IPs.

Switch on rail 2 asks: "who has 10.0.1.14?"
   (That IP belongs to ib0 on rail 1)
Linux default:
   ib1 sees the ARP request, knows 10.0.1.14 is on this host
   ib1 RESPONDS with ib1's MAC
Now switch on rail 2 thinks 10.0.1.14 lives behind ib1
   → routes 10.0.1.14 traffic to rail 2's leaf
   → packets show up on ib1
   → rp_filter sees src=10.0.1.14 arriving on wrong NIC → DROPS them

That's ARP flux. The host advertises every IP via every NIC. Switches get confused. Lossless RoCE traffic ends up on the wrong rails, dropped by reverse-path filter, mis-classified for PFC/ECN. Everything breaks silently.

The fix — two sysctls per NIC

# arp_ignore=1: Reply ONLY if the requested IP is configured on THIS NIC
echo 1 > /proc/sys/net/ipv4/conf/ib0/arp_ignore
echo 1 > /proc/sys/net/ipv4/conf/ib1/arp_ignore
echo 1 > /proc/sys/net/ipv4/conf/ib2/arp_ignore
echo 1 > /proc/sys/net/ipv4/conf/ib3/arp_ignore

# arp_announce=2: Always announce the IP that belongs to the egress NIC
echo 2 > /proc/sys/net/ipv4/conf/ib0/arp_announce
echo 2 > /proc/sys/net/ipv4/conf/ib1/arp_announce
echo 2 > /proc/sys/net/ipv4/conf/ib2/arp_announce
echo 2 > /proc/sys/net/ipv4/conf/ib3/arp_announce

After these settings:

ib0 only responds to ARP for 10.0.1.14.
ib1 only responds to ARP for 10.0.2.14.
Switches learn the correct MAC↔IP mapping per rail.
Each rail's traffic stays on its rail.

Possible values cheat sheet

Setting	Value	Meaning
`arp_ignore`	0	Reply for any local IP (default — UNSAFE for multi-rail).
`arp_ignore`	1	Reply only if requested IP is on this interface.
`arp_ignore`	2	Reply only if requested IP is on this interface AND same subnet.
`arp_announce`	0	Use any local IP (default — UNSAFE).
`arp_announce`	1	Try to use IP from same subnet as target.
`arp_announce`	2	Always use the best local IP for the egress interface.

For RoCE: arp_ignore=1, arp_announce=2 is the standard combination. Some operators run arp_ignore=2 instead — stricter, requires the requester to be in the same subnet, which is fine for rail-optimized topology because the rail leaf always is.

The reverse-path filter (rp_filter)

Linux has another safety mechanism that bites multi-rail setups: the reverse-path filter (rp_filter).

What it does

When a packet arrives on NIC X, rp_filter asks: "if I had to reply to this source IP, would my routing tell me to use NIC X?" If not, the packet is treated as spoofed and dropped silently. It's RFC 3704 anti-spoofing — useful on edge boxes, hostile on multi-NIC HPC hosts.

Why this is a problem for multi-rail

A packet arrives on ib1 with source = 192.168.30.40 (some remote IP)
rp_filter looks up "what's the route to 192.168.30.40?"
With default kernel routing, it might pick ib0 (the lowest-metric NIC)
"My route says I'd reach this peer via ib0, but the packet came in ib1"
→ DROPS the packet

You see this as packets arriving at the NIC (counters increment) but never reaching the application. Silent. Maddening.

The fix

Use loose mode (rp_filter=2) — the kernel checks that the source IP is routable but doesn't insist the incoming NIC matches.

# Loose RP filter — packet must be routable, but any interface is OK
echo 2 > /proc/sys/net/ipv4/conf/all/rp_filter
echo 2 > /proc/sys/net/ipv4/conf/ib0/rp_filter
echo 2 > /proc/sys/net/ipv4/conf/ib1/rp_filter
echo 2 > /proc/sys/net/ipv4/conf/ib2/rp_filter
echo 2 > /proc/sys/net/ipv4/conf/ib3/rp_filter

Values cheat sheet

Value	Meaning	Use when
0	Disabled	Almost never (loses spoof protection).
1	Strict (RFC 3704)	Single-NIC simple host.
2	Loose	Multi-rail HPC ← you.

Linux uses max(all, per-interface) for these sysctls. Setting all=2 is sufficient on its own — per-NIC zeros are just inheritance defaults, not active misconfig. Set both to be defensive.

`accept_local` — when both endpoints are on this host

One more knob. Inside a single host (loopback, intra-host RDMA tests), packets can arrive at an interface from a source IP that's also on this host. By default, Linux treats this as "Martian" traffic and drops it.

echo 1 > /proc/sys/net/ipv4/conf/all/accept_local

This lets ib1 receive a packet whose source is 10.0.1.14 (ib0's IP) without dropping it as a spoof. Important for loopback RDMA tests and certain GPU↔NIC topologies where the kernel sees traffic looping locally.

The complete host-side config — in one block

The standard multi-rail RoCE host config, assuming the rail layout above:

# 1. Source-routing tables (one per NIC)
ip route add 10.0.1.0/24    dev ib0  src 10.0.1.14   table 101
ip route add default        via 10.0.1.1   dev ib0   table 101

ip route add 10.0.2.0/24    dev ib1  src 10.0.2.14   table 102
ip route add default        via 10.0.2.1   dev ib1   table 102

ip route add 10.0.3.0/24    dev ib2  src 10.0.3.14   table 103
ip route add default        via 10.0.3.1   dev ib2   table 103

ip route add 10.0.4.0/24    dev ib3  src 10.0.4.14   table 104
ip route add default        via 10.0.4.1   dev ib3   table 104

# 2. Source-based rules
ip rule add from 10.0.1.14 lookup 101 priority 1001
ip rule add from 10.0.2.14 lookup 102 priority 1002
ip rule add from 10.0.3.14 lookup 103 priority 1003
ip rule add from 10.0.4.14 lookup 104 priority 1004

# 3. ARP tuning (per NIC)
for nic in ib0 ib1 ib2 ib3; do
  echo 1 > /proc/sys/net/ipv4/conf/$nic/arp_ignore
  echo 2 > /proc/sys/net/ipv4/conf/$nic/arp_announce
  echo 2 > /proc/sys/net/ipv4/conf/$nic/rp_filter
done
echo 2 > /proc/sys/net/ipv4/conf/all/rp_filter
echo 1 > /proc/sys/net/ipv4/conf/all/accept_local

In production, this is rendered by a config tool (Salt, Ansible, or your equivalent) into per-host NetworkManager keyfiles or systemd-networkd .network files. Don't run these by hand on production boxes — they need to survive reboots.

Sysctl inheritance: the `all` override trick

Many operators don't set arp_ignore/arp_announce per NIC. They set them only on all. The effective value for any interface ends up correct due to Linux's inheritance rule:

Effective value = max(all_value, per_interface_value)

For these specific sysctls, "max" wins. So if you set:

/proc/sys/net/ipv4/conf/all/arp_ignore   = 2
/proc/sys/net/ipv4/conf/ib0/arp_ignore   = 0   (unset, default)

The effective value for ib0 is max(2, 0) = 2. Setting on all is enough.

When you read sysctls and see zeros on per-NIC interfaces, don't panic — check the all value first. If all has the right value, the per-NIC zeros are harmless inheritance defaults.

A cleaner pattern than per-NIC explicit settings:

sudo sysctl -w net.ipv4.conf.all.arp_ignore=2
sudo sysctl -w net.ipv4.conf.all.arp_announce=1
sudo sysctl -w net.ipv4.conf.all.rp_filter=2
sudo sysctl -w net.ipv4.conf.all.accept_local=1

arp_ignore=2 is stricter than =1:

=1: respond to ARP if the requested IP is on this NIC.
=2: respond ONLY if the requested IP is on this NIC AND the requester is in the same subnet.

For multi-rail RoCE, =2 is the right call.

Discovering current routing state on a host

When you walk up to an unfamiliar box, this is the audit script:

cat > /tmp/multirail_routing.sh << 'SCRIPT_EOF'
echo "=== ip rule show ==="
ip rule show

echo ""
echo "=== Routing tables ==="
echo "-- table main --"
ip route show table main | head -30

for t in 101 102 103 104 105 106 107 108; do
  echo "-- table $t --"
  ip route show table $t 2>/dev/null
done

echo ""
echo "=== IP addresses per NIC ==="
for nic in ib0 ib1 ib2 ib3; do
  echo "-- $nic --"
  ip addr show dev $nic | grep -E "inet |state"
done

echo ""
echo "=== ARP / rp_filter / accept_local sysctls ==="
for nic in ib0 ib1 ib2 ib3 all default; do
  echo "-- $nic --"
  for k in arp_ignore arp_announce rp_filter accept_local; do
    v=$(cat /proc/sys/net/ipv4/conf/$nic/$k 2>/dev/null)
    echo "  $k = $v"
  done
done

echo ""
echo "=== ip route get tests (where does traffic actually go?) ==="
for src in 10.0.1.14 10.0.2.14 10.0.3.14 10.0.4.14; do
  echo "-- from $src to a peer (10.0.99.99) --"
  ip route get 10.0.99.99 from $src 2>/dev/null
done
SCRIPT_EOF
bash /tmp/multirail_routing.sh

What you're checking:

ip rule show — are there source-based rules already?
Custom routing tables (101–108) — do they exist?
Per-NIC IPs — what's actually assigned?
arp_ignore=1, arp_announce=2, rp_filter=2, accept_local=1 — are sysctls set?
ip route get tests — would traffic from each NIC's IP actually exit through that NIC?

What a live, well-configured host looks like

Here's a real audit from a 4-NIC training host (host-01) running source routing in production:

Attribute	Live value	Notes
Source-based rules	Configured at a high-priority IP allocation block	Rules slot in below `local`, above `main`.
Per-NIC routing tables	101–104 populated	Each routes the RoCE address space via its NIC.
Effective `arp_ignore`	2 (via `all` inheritance)	Stricter than the typical `=1` recommendation.
Effective `arp_announce`	1 (via `all` inheritance)	OK.
Effective `rp_filter`	2 (loose mode)	Correct for multi-rail.
Effective `accept_local`	1 (via `all` inheritance)	OK.
Control plane	Separate bonded interface on a standard DC subnet	RoCE NICs do not carry frontend traffic.
Outbound TCP CC	DCTCP for internal subnets, CUBIC for default	ECN-aware DC TCP variant on the control plane.
K8s overlay	VXLAN-based pod CIDR (e.g., Flannel)	Pod CIDRs reachable via the overlay interface.

A real-world IP plan twist — fewer subnets, more NICs

The 4-distinct-rail-subnets pattern is the textbook layout. In the field, you'll often see a simpler variation:

RAIL SUBNET 10.0.1.0/24                RAIL SUBNET 10.0.2.0/24
  ib0  10.0.1.14   gw 10.0.1.1          ib2  10.0.2.14   gw 10.0.2.1
  ib1  10.0.1.16   gw 10.0.1.1          ib3  10.0.2.16   gw 10.0.2.1
            (shared gateway)                        (shared gateway)

Each /24 has two NICs sharing the gateway. This shows up in proof-of-concept clusters and in some smaller deployments where four distinct rail subnets aren't strictly needed.

Source-based rules with this layout look like:

 from 10.0.1.14  lookup 101    ← ib0
 from 10.0.1.16  lookup 102    ← ib1
 from 10.0.2.14  lookup 103    ← ib2
 from 10.0.2.16  lookup 104    ← ib3

Each table has just ONE route — the entire RoCE address space via that NIC:

table 101: 10.0.0.0/16 via 10.0.1.1   dev ib0  metric 10
table 102: 10.0.0.0/16 via 10.0.1.1   dev ib1  metric 10
table 103: 10.0.0.0/16 via 10.0.2.1   dev ib2  metric 10
table 104: 10.0.0.0/16 via 10.0.2.1   dev ib3  metric 10

Tables 101 and 102 share gateway 10.0.1.1 because ib0 and ib1 are on the same /24. Tables 103 and 104 share 10.0.2.1 for the same reason. The dev field forces the egress NIC, even though the gateway is shared.

Two ip route get tests prove source routing works:

from 10.0.1.14  → via 10.0.1.1  dev ib0  table 101 ✓
from 10.0.2.14  → via 10.0.2.1  dev ib2  table 103 ✓

Each source IP correctly routes via its dedicated table and exits through its dedicated NIC. Multi-rail source routing is functional on this host.

Three networking layers, one host

Production AI training hosts juggle three networking layers simultaneously:

Backend RoCE: ib0–ib3 with source-routed traffic to the rail leaves.
Frontend / control plane: a bonded interface with standard DC routing for SSH, monitoring, image pulls.
K8s pod overlay: a VXLAN tunnel for pod-to-pod East-West traffic that doesn't touch the RoCE rails.

When you see "DCTCP enabled for internal traffic" on these hosts, that's the control plane: pods talking to other DC subnets over the bonded NIC use Data Center TCP (ECN-aware) instead of standard CUBIC. The RoCE rails don't use TCP at all — they're RDMA — so DCTCP doesn't apply there.

What's missing when SR-IOV joins the party

Once you start handing out VFs from these NICs to pods, you'll need to add:

Additional source rules for each VF's IP (VF0 → table 105, VF1 → 106, etc.), or extend the existing tables to cover VF address ranges.
Per-VF routing tables, OR a single table per rail that handles all VFs in that rail's subnet.
ARP / sysctl already covered by current all settings — no change needed.
Bond / frontend interface unchanged — VFs use the ib* rail paths, not the frontend.

Operations playbook — create, validate, verify, use

Watch the four-rail build run end-to-end on the rockynet lab simulator — routing tables created, source rules wired, ARP-flux + rp_filter sysctls applied, then ping verifying each rail reaches the right peer in isolation:

MODULE host-networking · LAB 2Watch the recording — every command, every counter, every output.

A. Create — applying the config

A.1 Assign IPs to each NIC (typically pre-done by the image)

# Static (nmcli — preferred for persistence)
sudo nmcli con mod ib0 ipv4.method manual ipv4.addresses 10.0.1.14/24
sudo nmcli con up ib0

# Or transient (lasts only until reboot)
sudo ip addr add 10.0.1.14/24 dev ib0
sudo ip link set ib0 up

A.2 Create per-NIC routing tables

# Per ib0 — replace IPs with your real config
sudo ip route add 10.0.1.0/24 dev ib0 src 10.0.1.14 table 101
sudo ip route add default via 10.0.1.1 dev ib0 table 101

# Repeat for ib1 (table 102), ib2 (table 103), ib3 (table 104)

Persistence: write these into /etc/NetworkManager/dispatcher.d/ or systemd-networkd .network files. In production, your host config tool emits these on each boot.

A.3 Add source-based rules

sudo ip rule add from 10.0.1.14 lookup 101 priority 1001
sudo ip rule add from 10.0.2.14 lookup 102 priority 1002
sudo ip rule add from 10.0.3.14 lookup 103 priority 1003
sudo ip rule add from 10.0.4.14 lookup 104 priority 1004

A.4 Apply per-NIC sysctls

for nic in ib0 ib1 ib2 ib3; do
  sudo sysctl -w net.ipv4.conf.$nic.arp_ignore=1
  sudo sysctl -w net.ipv4.conf.$nic.arp_announce=2
  sudo sysctl -w net.ipv4.conf.$nic.rp_filter=2
done
sudo sysctl -w net.ipv4.conf.all.rp_filter=2
sudo sysctl -w net.ipv4.conf.all.accept_local=1

For persistence: drop a file in /etc/sysctl.d/99-roce-multirail.conf with the same settings.

B. Validate — confirming the config is right

B.1 Verify the rule list

ip rule show
# Expected:
# 0:     from all lookup local
# 1001:  from 10.0.1.14 lookup 101
# 1002:  from 10.0.2.14 lookup 102
# 1003:  from 10.0.3.14 lookup 103
# 1004:  from 10.0.4.14 lookup 104
# 32766: from all lookup main
# 32767: from all lookup default

B.2 Verify each custom routing table is populated

for t in 101 102 103 104; do
  echo "table $t:"
  ip route show table $t
done
# Each table should have:
#   <subnet>/24 dev ibN src <ip>
#   default via <gateway> dev ibN

B.3 Verify per-NIC sysctls

for nic in ib0 ib1 ib2 ib3; do
  echo "$nic:"
  sysctl net.ipv4.conf.$nic.arp_ignore
  sysctl net.ipv4.conf.$nic.arp_announce
  sysctl net.ipv4.conf.$nic.rp_filter
done
# Expected:
#   arp_ignore = 1   (or 2, via 'all' inheritance)
#   arp_announce = 2
#   rp_filter = 2

C. Verify — proving traffic actually uses the right rail

C.1 `ip route get` — the routing decision oracle

# Test 1: traffic from ib0's IP to a peer
ip route get 10.0.99.99 from 10.0.1.14
# Should output: ... dev ib0 ... table 101

# Test 2: traffic from ib1's IP to same peer
ip route get 10.0.99.99 from 10.0.2.14
# Should output: ... dev ib1 ... table 102

# All four should pick the correct dev. If any pick the wrong NIC,
# your source-rule mapping is broken.

C.2 Capture ARP responses

# In one terminal, capture ARP traffic on each NIC
for nic in ib0 ib1 ib2 ib3; do
  sudo tcpdump -nn -e -i $nic arp -c 10 &
done

# In another terminal, simulate ARP probes from a peer
# (or use arping from another host targeting each of your IPs)

# Expected: ib0 ONLY answers ARP for 10.0.1.14, etc.
# If multiple NICs answer for one IP, arp_ignore is wrong.

C.3 Run a multi-rail RDMA test

# Set up ib_send_bw on a peer host listening on its ib1 IP
# On peer: ib_send_bw -d ib1 -x 3

# From this host:
ib_send_bw -d ib0 -x 3 <peer_ib0_ip>      # ib0 → peer's ib0
ib_send_bw -d ib1 -x 3 <peer_ib1_ip>      # ib1 → peer's ib1
ib_send_bw -d ib2 -x 3 <peer_ib2_ip>      # ib2 → peer's ib2
ib_send_bw -d ib3 -x 3 <peer_ib3_ip>      # ib3 → peer's ib3

# All four should hit ~380-395 Gbps each on 400G NICs.
# Aggregate: 1.5+ Tbps off the host.
# If one is dramatically slower, routing is putting traffic on the wrong rail.

C.4 Counter verification per NIC

# Before tests, snapshot port_xmit_data per NIC
for nic in ib0 ib1 ib2 ib3; do
  cat /sys/class/infiniband/$nic/ports/1/counters/port_xmit_data
done

# Run benchmarks against each NIC

# After tests, compute deltas
for nic in ib0 ib1 ib2 ib3; do
  cat /sys/class/infiniband/$nic/ports/1/counters/port_xmit_data
done

Each NIC's port_xmit_data should increment ONLY during its own test. If ib1's counter rises during the ib0 test, ib1 is mistakenly carrying ib0's traffic — that's a smoking gun for a broken source rule.

D. Use — how applications consume multi-rail

D.1 Pin a process to a specific NIC

Two complementary ways:

# Option 1: Bind the source IP — kernel picks NIC via source rules
SRC_IP=10.0.2.14   # ib1's IP
my_app --bind-ip $SRC_IP <args>

# Option 2: Set NIC explicitly via RDMA app env vars
export NCCL_IB_HCA=mlx5_2   # or ib1, depending on naming
my_app <args>

NCCL uses both: NCCL_IB_HCA picks the RDMA device; the RDMA QP marks its source IP correctly; kernel routes via the source rule. Three layers, all aligned.

D.2 NCCL multi-rail configuration

For 4 ranks per host, each using a different rail:

# rank 0 → ib0
export NCCL_IB_HCA=ib0
export NCCL_IB_GID_INDEX=3
export NCCL_SOCKET_IFNAME=ib0
mpirun -np 1 ... my_training_rank_0

# rank 1 → ib1
NCCL_IB_HCA=ib1 NCCL_IB_GID_INDEX=3 ... my_training_rank_1

# etc.

In practice, NCCL launchers do this automatically via NUMA-aware rank-to-NIC assignment based on the nvidia-smi topo -m matrix. The launcher reads the PCIe topology, figures out which NIC is closest to which GPU, and sets NCCL_IB_HCA accordingly.

D.3 Test with `arping` (cheap connectivity check)

# Force arping to use a specific source IP
arping -I ib0 -s 10.0.1.14 -c 3 10.0.1.1
arping -I ib1 -s 10.0.2.14 -c 3 10.0.2.1
arping -I ib2 -s 10.0.3.14 -c 3 10.0.3.1
arping -I ib3 -s 10.0.4.14 -c 3 10.0.4.1

Each should reach its rail's leaf gateway successfully. If any fails or replies come back via a different NIC, your ARP/routing config is broken.

D.4 Troubleshoot when a NIC isn't carrying its traffic

# 1. Did the kernel even consider routing through this NIC?
ip route get <dst> from <nic_src_ip>

# 2. Is the rule in place?
ip rule show | grep <nic_src_ip>

# 3. Is the table populated?
ip route show table <table_id>

# 4. Are ARP settings correct?
sysctl net.ipv4.conf.<nic>.arp_ignore
sysctl net.ipv4.conf.<nic>.arp_announce

# 5. Is rp_filter dropping it?
nstat -az | grep -i martian
# Martian packets = rp_filter or accept_local dropped them

That five-step sequence resolves most "a NIC went quiet" issues. Walk it in order.

How the hyperscalers do it

Operator	Multi-rail routing approach
Most production AI shops	Standard Linux source-based routing + per-NIC sysctls, rendered by config tools (Salt, Ansible) into NM keyfiles.
Meta	Same pattern, internal tooling. They contributed many of the upstream kernel improvements for multi-NIC routing.
Google	Multi-NIC GCP VMs use route-based isolation; bare-metal GPU pods use source routing similar to the pattern here.
NVIDIA DGX SuperPOD ref	Documented in the DGX networking guide — same source-based routing pattern, with `mlnx_qos --trust dscp` integration.
AWS EFA	Each EC2 instance gets ONE EFA per NIC; no in-host multi-rail routing needed because rails are isolated at the VM boundary.

Common theme: Everyone does roughly the same thing. The Linux source-routing pattern is the de facto standard for multi-rail HPC hosts. It's not exotic — it's just unfortunately not the default. The kernel could ship with multi-NIC-aware policy routing out of the box, but it doesn't.

Self-check

Why does default Linux destination-based routing fail for multi-rail RoCE?
How many routing tables can Linux have, and which are the three default ones?
What's the difference between ip rule and ip route?
What's the ARP flux problem in one sentence? What two sysctls fix it?
rp_filter=2 vs rp_filter=1 — which mode for multi-rail, and why?
ip route get 10.0.99.99 from 10.0.2.14 returns "dev ib0". What's wrong?
NCCL chooses NIC via NCCL_IB_HCA=ib1. Why does the kernel ALSO need to be configured to route via ib1?
What's accept_local=1 and when do you need it?

💡 What you should remember

#		Concept	Why it matters
1	🌐	Default Linux routing is destination-only	give it 4 NICs in overlapping subnets and it picks one, sends everything that way, and ignores the rest. Your fabric goes to ~25% utilization, silently.
2	🧮	Linux has 256 routing tables.	You'll use 4 of them (one per rail), each with a single default route via its NIC.
3	🔁	`ip rule` decides which table to consult; `ip route` populates the table.	The rule matches on source IP. Source IP comes from the NIC's QP. The match picks the right table. The table sends packets out the right NIC.
4	⚠️	ARP flux is the silent killer of multi-rail.	Set `arp_ignore=1` (or `=2`) and `arp_announce=2` so each NIC only speaks for its own IP. Otherwise switches learn the wrong MAC↔IP mapping and your rails cross-pollinate.
5	🛡️	`rp_filter=2` (loose)	on every NIC — strict mode drops legitimate ingress packets because the reply route would use a different NIC.
6	🔌	`accept_local=1`	keeps intra-host RDMA loopback tests from being classified as Martian.
7	🛠️	Your debugging cheat code is `ip route get <dst> from <src>`	it tells you exactly which NIC and table the kernel will pick. If that command lies to you, source routing is broken.
8	⚡	At 4 NICs × 400 Gbps, you should see ~1.5 Tbps off the host.	Anything dramatically less than that on aggregate `ib_send_bw` is almost always a routing problem before it's a hardware problem.

Next: Provisioning the GPU Host → — putting it all together: the host-side automation that turns a bare GPU server into a fabric-ready RoCE endpoint (driver stack, OFED, GPUDirect, PCIe ACS, fabric manager, DCGM).

How Linux routes packets by default​

Why this breaks RoCE specifically​

The fix in one sentence​

The Linux routing system you didn't know existed​

The three default tables​

Custom tables​

The source-routing pattern​

Flip the question​

The four pieces of the fix — at a glance​

Source-based routing — the full picture​

Step 1 — Each NIC has its own IP and gateway​

Step 2 — Create one routing table per NIC​

Step 3 — Add source-based rules​

Step 4 — Test the routing decision​

The ARP flux problem (and the fix)​

What is ARP flux?​

The fix — two sysctls per NIC​

Possible values cheat sheet​

The reverse-path filter (rp_filter)​

What it does​

Why this is a problem for multi-rail​

The fix​

Values cheat sheet​

accept_local — when both endpoints are on this host​

The complete host-side config — in one block​

Sysctl inheritance: the all override trick​

Discovering current routing state on a host​

What a live, well-configured host looks like​

A real-world IP plan twist — fewer subnets, more NICs​

Three networking layers, one host​

What's missing when SR-IOV joins the party​

Operations playbook — create, validate, verify, use​

A. Create — applying the config​

A.1 Assign IPs to each NIC (typically pre-done by the image)​

A.2 Create per-NIC routing tables​

A.3 Add source-based rules​

A.4 Apply per-NIC sysctls​

B. Validate — confirming the config is right​

B.1 Verify the rule list​

B.2 Verify each custom routing table is populated​

B.3 Verify per-NIC sysctls​

C. Verify — proving traffic actually uses the right rail​

C.1 ip route get — the routing decision oracle​

C.2 Capture ARP responses​

C.3 Run a multi-rail RDMA test​

C.4 Counter verification per NIC​

D. Use — how applications consume multi-rail​

D.1 Pin a process to a specific NIC​

D.2 NCCL multi-rail configuration​

D.3 Test with arping (cheap connectivity check)​

D.4 Troubleshoot when a NIC isn't carrying its traffic​

How the hyperscalers do it​

Self-check​

💡 What you should remember​