Multi-Rail Source Routing on the Host
You have 4 RoCE NICs on this host — ib0, ib1, ib2, ib3. Every rail goes to a different leaf switch. Every NIC has its own IP. The host must send traffic out the correct NIC — the one whose rail is its IP's home — or PFC, DCQCN, and source-NUMA all break.
By default, Linux can't do this. Out of the box, Linux will:
- Send all outgoing traffic via whichever NIC has the lowest-metric default route (usually
ib0). - Answer ARP requests for any IP via any NIC ("ARP flux").
- Drop ingress packets that arrive on the "wrong" NIC (reverse-path filter).
For multi-rail RoCE to work, you have to tell Linux explicitly that each NIC owns specific traffic. That's source-based routing. It's a wall of ip rule, ip route, and /proc/sys settings. It looks scary. It's actually a tidy little system once you see the model.
If you can read ip rule show and ip route show table 102 and immediately tell whether multi-rail routing is set up correctly, you can debug 70% of multi-rail RoCE bugs.
How Linux routes packets by default
When your process sends a packet, the kernel asks one question:
"What's the destination IP? Which route in the
maintable has the best (longest-prefix) match?"
It does NOT ask:
- Which NIC's QP created this packet?
- Which source IP is on this packet?
- What's the workload trying to do?
If all 4 of your RoCE rails are in the same /16 pod subnet, the kernel sees all 4 routes as candidates and picks ONE — whichever has the lowest metric. All your traffic goes out that one NIC. The other three rails sit idle.
App opens QP on ib1
|
| (creates packet with src=10.0.2.14, dst=peer)
v
Linux kernel: "where does peer live?"
|
| Looks up dst in 'main' table
| Finds: 10.0.0.0/16 via ib0 metric 100 (or whatever)
v
Sends packet out ib0 ← WRONG NIC!
|
v
ib0's rail gets all 4 NICs worth of traffic; ib1/ib2/ib3 quiet
The destination-only question fails when:
- All 4 NICs are in the same overarching subnet (the pod's address space)
- You want traffic from a specific source IP to go out a specific NIC, not "whichever has the cheapest route"
Why this breaks RoCE specifically
In a rail-optimized topology, every NIC connects to its own leaf:
Pod
+-----------------------------------+
| ASW (rail 1) ASW (rail 2) ASW (rail 3) ASW (rail 4)
| | | | |
+-----|--------------|--------------|--------------|---
| | | |
Host: ib0 ib1 ib2 ib3
10.0.1.14 10.0.2.14 10.0.3.14 10.0.4.14
/24 /24 /24 /24
Each NIC sits in a different /24 rail subnet. If you leave Linux's default routing in place:
- It picks "best" route based on metric.
- Everything exits
ib0ifib0has the lowest metric. - The rail-2 switch never sees
ib1's traffic — because the kernel didn't route viaib1.
Result:
- 3 of your 4 rails sit idle.
- Your "lossless DCQCN tuned per rail" config does nothing — all traffic on rail 1.
- PFC fires on
ib0's switch because it's overloaded;ib1/ib2/ib3stay quiet. - NCCL sees uneven throughput; jobs run at ~1/4 expected speed.
The fix in one sentence
Use source-based routing: tell Linux that traffic from each NIC's IP must egress through THAT specific NIC, via a per-NIC routing table.
The Linux routing system you didn't know existed
Most people think Linux has one routing table. It has up to 256 of them. You've only ever seen one — main — by default. The other 255 are sitting there waiting for you to populate them. Think of it like BGP RIBs: you can have many, each consulted under different conditions, and rules decide which one wins.
The three default tables
ip rule show
On a vanilla Linux box:
0: from all lookup local
32766: from all lookup main
32767: from all lookup default
| Table | ID | Purpose |
|---|---|---|
local | 255 | Loopback and local IPs — anything destined for this host. |
main | 254 | The default user-visible routing table. |
default | 253 | Fallback (almost always empty). |
Each rule says: "for traffic matching this rule, look up the named table." Rules are evaluated in order of priority (low number = high priority). It's exactly like a policy route-map matched in order.
Custom tables
You can define your own tables (1 through 252). Each gets its own set of routes. The art is in the rules that decide which table to consult.
The source-routing pattern
For each NIC:
1. Create a dedicated routing table (101, 102, 103, 104)
2. Add routes to that table:
- Direct route for the NIC's subnet
- Default route via the NIC's gateway
3. Add an ip rule: "if source IP == NIC's IP, look up its table"
Flip the question
Source-based routing tells the kernel to ask a different question first:
"What's the SOURCE IP of this packet? Is there a rule that says 'for this source, use this table'?"
If yes, jump to that custom table. THAT table has only one default route — out the correct NIC. Decision made before the destination lookup even matters.
App opens QP on ib1
|
| (creates packet with src=10.0.2.14, dst=peer)
v
Linux kernel: "is there a rule matching this source?"
|
| Looks up rule: from 10.0.2.14 lookup 102 ✓
| Goes to table 102 (instead of main)
v
Table 102: "default route via ib1's gateway, dev ib1"
|
v
Sends packet out ib1 ← CORRECT NIC ✓
Packets from each NIC's source IP now egress via that NIC's route — because the rule picks the correct table, which has the correct route.
The four pieces of the fix — at a glance
┌────────────────────────────────────────────────────────────────┐
│ 1. PER-NIC ROUTING TABLES (101, 102, 103, 104) │
│ Each table has just ONE default route via the right NIC │
│ │
│ table 101: default via 10.0.1.1 dev ib0 │
│ table 102: default via 10.0.2.1 dev ib1 │
│ table 103: default via 10.0.3.1 dev ib2 │
│ table 104: default via 10.0.4.1 dev ib3 │
└────────────────────────────────────────────────────────────────┘
↑
│ consulted when matching rule fires
│
┌────────────────────────────────────────────────────────────────┐
│ 2. SOURCE-BASED RULES │
│ "If source IP is X, look up table Y" │
│ │
│ from 10.0.1.14 lookup 101 │
│ from 10.0.2.14 lookup 102 │
│ from 10.0.3.14 lookup 103 │
│ from 10.0.4.14 lookup 104 │
└────────────────────────────────────────────────────────────────┘
┌────────────────────────────────────────────────────────────────┐
│ 3. PER-NIC ARP TUNING │
│ Each NIC only ARP-replies for its own IP │
│ Each NIC announces with its own IP │
│ │
│ arp_ignore=1, arp_announce=2 (per NIC) │
└────────────────────────────────────────────────────────────────┘
┌────────────────────────────────────────────────────────────────┐
│ 4. LOOSE REVERSE-PATH FILTER │
│ Don't drop ingress packets just because the reply path │
│ would use a different NIC │
│ │
│ rp_filter=2 (everywhere) │
└────────────────────────────────────────────────────────────────┘
Each piece solves one specific bug:
- Pieces 1 + 2 → egress goes out the right NIC.
- Piece 3 → switches learn the right MAC↔IP mapping per rail.
- Piece 4 → ingress packets don't get silently dropped.
Source-based routing — the full picture
Let's build the picture step by step.
Step 1 — Each NIC has its own IP and gateway
ib0: IP 10.0.1.14/24 gateway 10.0.1.1 (rail 1 leaf)
ib1: IP 10.0.2.14/24 gateway 10.0.2.1 (rail 2 leaf)
ib2: IP 10.0.3.14/24 gateway 10.0.3.1 (rail 3 leaf)
ib3: IP 10.0.4.14/24 gateway 10.0.4.1 (rail 4 leaf)
Step 2 — Create one routing table per NIC
# Each table has TWO routes: subnet and default
ip route add 10.0.1.0/24 dev ib0 src 10.0.1.14 table 101
ip route add default via 10.0.1.1 dev ib0 table 101
ip route add 10.0.2.0/24 dev ib1 src 10.0.2.14 table 102
ip route add default via 10.0.2.1 dev ib1 table 102
# ...same for ib2 (table 103) and ib3 (table 104)
Step 3 — Add source-based rules
# For each NIC, "if source IP matches this NIC's IP, use that NIC's table"
ip rule add from 10.0.1.14 lookup 101 priority 1001
ip rule add from 10.0.2.14 lookup 102 priority 1002
ip rule add from 10.0.3.14 lookup 103 priority 1003
ip rule add from 10.0.4.14 lookup 104 priority 1004
The rule table now looks like:
0: from all lookup local
1001: from 10.0.1.14 lookup 101
1002: from 10.0.2.14 lookup 102
1003: from 10.0.3.14 lookup 103
1004: from 10.0.4.14 lookup 104
32766: from all lookup main
32767: from all lookup default
Step 4 — Test the routing decision
# Verify routing path for traffic from each source IP
ip route get 10.0.2.99 from 10.0.1.14
# → 10.0.2.99 from 10.0.1.14 via 10.0.1.1 dev ib0 table 101
# → Uses ib0's table, exits via ib0 ✓
ip route get 10.0.2.99 from 10.0.2.14
# → 10.0.2.99 from 10.0.2.14 dev ib1 table 102
# → Uses ib1's table, exits via ib1 ✓
What's happening:
- When the QP for
ib1wants to send a packet, the NIC stamps the packet's source IP as10.0.2.14(ib1's IP). - Linux sees source =
10.0.2.14, matches rule 1002, looks up table 102. - Table 102 has a default route via
ib1→ packet egresses viaib1✓.
Each NIC's traffic stays on its rail. Multi-rail routing solved.
The ARP flux problem (and the fix)
Source routing handles egress. Ingress has its own problem: ARP flux.
What is ARP flux?
ARP = Address Resolution Protocol. When a switch needs to send a packet to 10.0.1.14, it broadcasts "who has 10.0.1.14?" The host's NIC with that IP should respond with its MAC. The switch then forwards the packet.
Default Linux behavior: ANY NIC will respond to ARP requests for ANY of the host's IPs.
Switch on rail 2 asks: "who has 10.0.1.14?"
(That IP belongs to ib0 on rail 1)
Linux default:
ib1 sees the ARP request, knows 10.0.1.14 is on this host
ib1 RESPONDS with ib1's MAC
Now switch on rail 2 thinks 10.0.1.14 lives behind ib1
→ routes 10.0.1.14 traffic to rail 2's leaf
→ packets show up on ib1
→ rp_filter sees src=10.0.1.14 arriving on wrong NIC → DROPS them
That's ARP flux. The host advertises every IP via every NIC. Switches get confused. Lossless RoCE traffic ends up on the wrong rails, dropped by reverse-path filter, mis-classified for PFC/ECN. Everything breaks silently.
The fix — two sysctls per NIC
# arp_ignore=1: Reply ONLY if the requested IP is configured on THIS NIC
echo 1 > /proc/sys/net/ipv4/conf/ib0/arp_ignore
echo 1 > /proc/sys/net/ipv4/conf/ib1/arp_ignore
echo 1 > /proc/sys/net/ipv4/conf/ib2/arp_ignore
echo 1 > /proc/sys/net/ipv4/conf/ib3/arp_ignore
# arp_announce=2: Always announce the IP that belongs to the egress NIC
echo 2 > /proc/sys/net/ipv4/conf/ib0/arp_announce
echo 2 > /proc/sys/net/ipv4/conf/ib1/arp_announce
echo 2 > /proc/sys/net/ipv4/conf/ib2/arp_announce
echo 2 > /proc/sys/net/ipv4/conf/ib3/arp_announce
After these settings:
ib0only responds to ARP for10.0.1.14.ib1only responds to ARP for10.0.2.14.- Switches learn the correct MAC↔IP mapping per rail.
- Each rail's traffic stays on its rail.
Possible values cheat sheet
| Setting | Value | Meaning |
|---|---|---|
arp_ignore | 0 | Reply for any local IP (default — UNSAFE for multi-rail). |
arp_ignore | 1 | Reply only if requested IP is on this interface. |
arp_ignore | 2 | Reply only if requested IP is on this interface AND same subnet. |
arp_announce | 0 | Use any local IP (default — UNSAFE). |
arp_announce | 1 | Try to use IP from same subnet as target. |
arp_announce | 2 | Always use the best local IP for the egress interface. |
For RoCE: arp_ignore=1, arp_announce=2 is the standard combination. Some operators run arp_ignore=2 instead — stricter, requires the requester to be in the same subnet, which is fine for rail-optimized topology because the rail leaf always is.
The reverse-path filter (rp_filter)
Linux has another safety mechanism that bites multi-rail setups: the reverse-path filter (rp_filter).
What it does
When a packet arrives on NIC X, rp_filter asks: "if I had to reply to this source IP, would my routing tell me to use NIC X?" If not, the packet is treated as spoofed and dropped silently. It's RFC 3704 anti-spoofing — useful on edge boxes, hostile on multi-NIC HPC hosts.
Why this is a problem for multi-rail
A packet arrives on ib1 with source = 192.168.30.40 (some remote IP)
rp_filter looks up "what's the route to 192.168.30.40?"
With default kernel routing, it might pick ib0 (the lowest-metric NIC)
"My route says I'd reach this peer via ib0, but the packet came in ib1"
→ DROPS the packet
You see this as packets arriving at the NIC (counters increment) but never reaching the application. Silent. Maddening.
The fix
Use loose mode (rp_filter=2) — the kernel checks that the source IP is routable but doesn't insist the incoming NIC matches.
# Loose RP filter — packet must be routable, but any interface is OK
echo 2 > /proc/sys/net/ipv4/conf/all/rp_filter
echo 2 > /proc/sys/net/ipv4/conf/ib0/rp_filter
echo 2 > /proc/sys/net/ipv4/conf/ib1/rp_filter
echo 2 > /proc/sys/net/ipv4/conf/ib2/rp_filter
echo 2 > /proc/sys/net/ipv4/conf/ib3/rp_filter
Values cheat sheet
| Value | Meaning | Use when |
|---|---|---|
| 0 | Disabled | Almost never (loses spoof protection). |
| 1 | Strict (RFC 3704) | Single-NIC simple host. |
| 2 | Loose | Multi-rail HPC ← you. |
Linux uses
max(all, per-interface)for these sysctls. Settingall=2is sufficient on its own — per-NIC zeros are just inheritance defaults, not active misconfig. Set both to be defensive.
accept_local — when both endpoints are on this host
One more knob. Inside a single host (loopback, intra-host RDMA tests), packets can arrive at an interface from a source IP that's also on this host. By default, Linux treats this as "Martian" traffic and drops it.
echo 1 > /proc/sys/net/ipv4/conf/all/accept_local
This lets ib1 receive a packet whose source is 10.0.1.14 (ib0's IP) without dropping it as a spoof. Important for loopback RDMA tests and certain GPU↔NIC topologies where the kernel sees traffic looping locally.
The complete host-side config — in one block
The standard multi-rail RoCE host config, assuming the rail layout above:
# 1. Source-routing tables (one per NIC)
ip route add 10.0.1.0/24 dev ib0 src 10.0.1.14 table 101
ip route add default via 10.0.1.1 dev ib0 table 101
ip route add 10.0.2.0/24 dev ib1 src 10.0.2.14 table 102
ip route add default via 10.0.2.1 dev ib1 table 102
ip route add 10.0.3.0/24 dev ib2 src 10.0.3.14 table 103
ip route add default via 10.0.3.1 dev ib2 table 103
ip route add 10.0.4.0/24 dev ib3 src 10.0.4.14 table 104
ip route add default via 10.0.4.1 dev ib3 table 104
# 2. Source-based rules
ip rule add from 10.0.1.14 lookup 101 priority 1001
ip rule add from 10.0.2.14 lookup 102 priority 1002
ip rule add from 10.0.3.14 lookup 103 priority 1003
ip rule add from 10.0.4.14 lookup 104 priority 1004
# 3. ARP tuning (per NIC)
for nic in ib0 ib1 ib2 ib3; do
echo 1 > /proc/sys/net/ipv4/conf/$nic/arp_ignore
echo 2 > /proc/sys/net/ipv4/conf/$nic/arp_announce
echo 2 > /proc/sys/net/ipv4/conf/$nic/rp_filter
done
echo 2 > /proc/sys/net/ipv4/conf/all/rp_filter
echo 1 > /proc/sys/net/ipv4/conf/all/accept_local
In production, this is rendered by a config tool (Salt, Ansible, or your equivalent) into per-host NetworkManager keyfiles or systemd-networkd .network files. Don't run these by hand on production boxes — they need to survive reboots.
Sysctl inheritance: the all override trick
Many operators don't set arp_ignore/arp_announce per NIC. They set them only on all. The effective value for any interface ends up correct due to Linux's inheritance rule:
Effective value = max(all_value, per_interface_value)
For these specific sysctls, "max" wins. So if you set:
/proc/sys/net/ipv4/conf/all/arp_ignore = 2
/proc/sys/net/ipv4/conf/ib0/arp_ignore = 0 (unset, default)
The effective value for ib0 is max(2, 0) = 2. Setting on all is enough.
When you read sysctls and see zeros on per-NIC interfaces, don't panic — check the
allvalue first. Ifallhas the right value, the per-NIC zeros are harmless inheritance defaults.
A cleaner pattern than per-NIC explicit settings:
sudo sysctl -w net.ipv4.conf.all.arp_ignore=2
sudo sysctl -w net.ipv4.conf.all.arp_announce=1
sudo sysctl -w net.ipv4.conf.all.rp_filter=2
sudo sysctl -w net.ipv4.conf.all.accept_local=1
arp_ignore=2 is stricter than =1:
=1: respond to ARP if the requested IP is on this NIC.=2: respond ONLY if the requested IP is on this NIC AND the requester is in the same subnet.
For multi-rail RoCE, =2 is the right call.
Discovering current routing state on a host
When you walk up to an unfamiliar box, this is the audit script:
cat > /tmp/multirail_routing.sh << 'SCRIPT_EOF'
echo "=== ip rule show ==="
ip rule show
echo ""
echo "=== Routing tables ==="
echo "-- table main --"
ip route show table main | head -30
for t in 101 102 103 104 105 106 107 108; do
echo "-- table $t --"
ip route show table $t 2>/dev/null
done
echo ""
echo "=== IP addresses per NIC ==="
for nic in ib0 ib1 ib2 ib3; do
echo "-- $nic --"
ip addr show dev $nic | grep -E "inet |state"
done
echo ""
echo "=== ARP / rp_filter / accept_local sysctls ==="
for nic in ib0 ib1 ib2 ib3 all default; do
echo "-- $nic --"
for k in arp_ignore arp_announce rp_filter accept_local; do
v=$(cat /proc/sys/net/ipv4/conf/$nic/$k 2>/dev/null)
echo " $k = $v"
done
done
echo ""
echo "=== ip route get tests (where does traffic actually go?) ==="
for src in 10.0.1.14 10.0.2.14 10.0.3.14 10.0.4.14; do
echo "-- from $src to a peer (10.0.99.99) --"
ip route get 10.0.99.99 from $src 2>/dev/null
done
SCRIPT_EOF
bash /tmp/multirail_routing.sh
What you're checking:
ip rule show— are there source-based rules already?- Custom routing tables (101–108) — do they exist?
- Per-NIC IPs — what's actually assigned?
arp_ignore=1,arp_announce=2,rp_filter=2,accept_local=1— are sysctls set?ip route gettests — would traffic from each NIC's IP actually exit through that NIC?
What a live, well-configured host looks like
Here's a real audit from a 4-NIC training host (host-01) running source routing in production:
| Attribute | Live value | Notes |
|---|---|---|
| Source-based rules | Configured at a high-priority IP allocation block | Rules slot in below local, above main. |
| Per-NIC routing tables | 101–104 populated | Each routes the RoCE address space via its NIC. |
Effective arp_ignore | 2 (via all inheritance) | Stricter than the typical =1 recommendation. |
Effective arp_announce | 1 (via all inheritance) | OK. |
Effective rp_filter | 2 (loose mode) | Correct for multi-rail. |
Effective accept_local | 1 (via all inheritance) | OK. |
| Control plane | Separate bonded interface on a standard DC subnet | RoCE NICs do not carry frontend traffic. |
| Outbound TCP CC | DCTCP for internal subnets, CUBIC for default | ECN-aware DC TCP variant on the control plane. |
| K8s overlay | VXLAN-based pod CIDR (e.g., Flannel) | Pod CIDRs reachable via the overlay interface. |
A real-world IP plan twist — fewer subnets, more NICs
The 4-distinct-rail-subnets pattern is the textbook layout. In the field, you'll often see a simpler variation:
RAIL SUBNET 10.0.1.0/24 RAIL SUBNET 10.0.2.0/24
ib0 10.0.1.14 gw 10.0.1.1 ib2 10.0.2.14 gw 10.0.2.1
ib1 10.0.1.16 gw 10.0.1.1 ib3 10.0.2.16 gw 10.0.2.1
(shared gateway) (shared gateway)
Each /24 has two NICs sharing the gateway. This shows up in proof-of-concept clusters and in some smaller deployments where four distinct rail subnets aren't strictly needed.
Source-based rules with this layout look like:
1001: from 10.0.1.14 lookup 101 ← ib0
1002: from 10.0.1.16 lookup 102 ← ib1
1003: from 10.0.2.14 lookup 103 ← ib2
1004: from 10.0.2.16 lookup 104 ← ib3
Each table has just ONE route — the entire RoCE address space via that NIC:
table 101: 10.0.0.0/16 via 10.0.1.1 dev ib0 metric 10
table 102: 10.0.0.0/16 via 10.0.1.1 dev ib1 metric 10
table 103: 10.0.0.0/16 via 10.0.2.1 dev ib2 metric 10
table 104: 10.0.0.0/16 via 10.0.2.1 dev ib3 metric 10
Tables 101 and 102 share gateway 10.0.1.1 because ib0 and ib1 are on the same /24. Tables 103 and 104 share 10.0.2.1 for the same reason. The dev field forces the egress NIC, even though the gateway is shared.
Two ip route get tests prove source routing works:
from 10.0.1.14 → via 10.0.1.1 dev ib0 table 101 ✓
from 10.0.2.14 → via 10.0.2.1 dev ib2 table 103 ✓
Each source IP correctly routes via its dedicated table and exits through its dedicated NIC. Multi-rail source routing is functional on this host.
Three networking layers, one host
Production AI training hosts juggle three networking layers simultaneously:
- Backend RoCE:
ib0–ib3with source-routed traffic to the rail leaves. - Frontend / control plane: a bonded interface with standard DC routing for SSH, monitoring, image pulls.
- K8s pod overlay: a VXLAN tunnel for pod-to-pod East-West traffic that doesn't touch the RoCE rails.
When you see "DCTCP enabled for internal traffic" on these hosts, that's the control plane: pods talking to other DC subnets over the bonded NIC use Data Center TCP (ECN-aware) instead of standard CUBIC. The RoCE rails don't use TCP at all — they're RDMA — so DCTCP doesn't apply there.
What's missing when SR-IOV joins the party
Once you start handing out VFs from these NICs to pods, you'll need to add:
- Additional source rules for each VF's IP (VF0 → table 105, VF1 → 106, etc.), or extend the existing tables to cover VF address ranges.
- Per-VF routing tables, OR a single table per rail that handles all VFs in that rail's subnet.
- ARP / sysctl already covered by current
allsettings — no change needed. - Bond / frontend interface unchanged — VFs use the
ib*rail paths, not the frontend.
Operations playbook — create, validate, verify, use
Watch the four-rail build run end-to-end on the rockynet lab simulator — routing tables created, source rules wired, ARP-flux + rp_filter sysctls applied, then ping verifying each rail reaches the right peer in isolation:
A. Create — applying the config
A.1 Assign IPs to each NIC (typically pre-done by the image)
# Static (nmcli — preferred for persistence)
sudo nmcli con mod ib0 ipv4.method manual ipv4.addresses 10.0.1.14/24
sudo nmcli con up ib0
# Or transient (lasts only until reboot)
sudo ip addr add 10.0.1.14/24 dev ib0
sudo ip link set ib0 up
A.2 Create per-NIC routing tables
# Per ib0 — replace IPs with your real config
sudo ip route add 10.0.1.0/24 dev ib0 src 10.0.1.14 table 101
sudo ip route add default via 10.0.1.1 dev ib0 table 101
# Repeat for ib1 (table 102), ib2 (table 103), ib3 (table 104)
Persistence: write these into /etc/NetworkManager/dispatcher.d/ or systemd-networkd .network files. In production, your host config tool emits these on each boot.
A.3 Add source-based rules
sudo ip rule add from 10.0.1.14 lookup 101 priority 1001
sudo ip rule add from 10.0.2.14 lookup 102 priority 1002
sudo ip rule add from 10.0.3.14 lookup 103 priority 1003
sudo ip rule add from 10.0.4.14 lookup 104 priority 1004
A.4 Apply per-NIC sysctls
for nic in ib0 ib1 ib2 ib3; do
sudo sysctl -w net.ipv4.conf.$nic.arp_ignore=1
sudo sysctl -w net.ipv4.conf.$nic.arp_announce=2
sudo sysctl -w net.ipv4.conf.$nic.rp_filter=2
done
sudo sysctl -w net.ipv4.conf.all.rp_filter=2
sudo sysctl -w net.ipv4.conf.all.accept_local=1
For persistence: drop a file in /etc/sysctl.d/99-roce-multirail.conf with the same settings.
B. Validate — confirming the config is right
B.1 Verify the rule list
ip rule show
# Expected:
# 0: from all lookup local
# 1001: from 10.0.1.14 lookup 101
# 1002: from 10.0.2.14 lookup 102
# 1003: from 10.0.3.14 lookup 103
# 1004: from 10.0.4.14 lookup 104
# 32766: from all lookup main
# 32767: from all lookup default
B.2 Verify each custom routing table is populated
for t in 101 102 103 104; do
echo "table $t:"
ip route show table $t
done
# Each table should have:
# <subnet>/24 dev ibN src <ip>
# default via <gateway> dev ibN
B.3 Verify per-NIC sysctls
for nic in ib0 ib1 ib2 ib3; do
echo "$nic:"
sysctl net.ipv4.conf.$nic.arp_ignore
sysctl net.ipv4.conf.$nic.arp_announce
sysctl net.ipv4.conf.$nic.rp_filter
done
# Expected:
# arp_ignore = 1 (or 2, via 'all' inheritance)
# arp_announce = 2
# rp_filter = 2
C. Verify — proving traffic actually uses the right rail
C.1 ip route get — the routing decision oracle
# Test 1: traffic from ib0's IP to a peer
ip route get 10.0.99.99 from 10.0.1.14
# Should output: ... dev ib0 ... table 101
# Test 2: traffic from ib1's IP to same peer
ip route get 10.0.99.99 from 10.0.2.14
# Should output: ... dev ib1 ... table 102
# All four should pick the correct dev. If any pick the wrong NIC,
# your source-rule mapping is broken.
C.2 Capture ARP responses
# In one terminal, capture ARP traffic on each NIC
for nic in ib0 ib1 ib2 ib3; do
sudo tcpdump -nn -e -i $nic arp -c 10 &
done
# In another terminal, simulate ARP probes from a peer
# (or use arping from another host targeting each of your IPs)
# Expected: ib0 ONLY answers ARP for 10.0.1.14, etc.
# If multiple NICs answer for one IP, arp_ignore is wrong.
C.3 Run a multi-rail RDMA test
# Set up ib_send_bw on a peer host listening on its ib1 IP
# On peer: ib_send_bw -d ib1 -x 3
# From this host:
ib_send_bw -d ib0 -x 3 <peer_ib0_ip> # ib0 → peer's ib0
ib_send_bw -d ib1 -x 3 <peer_ib1_ip> # ib1 → peer's ib1
ib_send_bw -d ib2 -x 3 <peer_ib2_ip> # ib2 → peer's ib2
ib_send_bw -d ib3 -x 3 <peer_ib3_ip> # ib3 → peer's ib3
# All four should hit ~380-395 Gbps each on 400G NICs.
# Aggregate: 1.5+ Tbps off the host.
# If one is dramatically slower, routing is putting traffic on the wrong rail.
C.4 Counter verification per NIC
# Before tests, snapshot port_xmit_data per NIC
for nic in ib0 ib1 ib2 ib3; do
cat /sys/class/infiniband/$nic/ports/1/counters/port_xmit_data
done
# Run benchmarks against each NIC
# After tests, compute deltas
for nic in ib0 ib1 ib2 ib3; do
cat /sys/class/infiniband/$nic/ports/1/counters/port_xmit_data
done
Each NIC's port_xmit_data should increment ONLY during its own test. If ib1's counter rises during the ib0 test, ib1 is mistakenly carrying ib0's traffic — that's a smoking gun for a broken source rule.
D. Use — how applications consume multi-rail
D.1 Pin a process to a specific NIC
Two complementary ways:
# Option 1: Bind the source IP — kernel picks NIC via source rules
SRC_IP=10.0.2.14 # ib1's IP
my_app --bind-ip $SRC_IP <args>
# Option 2: Set NIC explicitly via RDMA app env vars
export NCCL_IB_HCA=mlx5_2 # or ib1, depending on naming
my_app <args>
NCCL uses both: NCCL_IB_HCA picks the RDMA device; the RDMA QP marks its source IP correctly; kernel routes via the source rule. Three layers, all aligned.
D.2 NCCL multi-rail configuration
For 4 ranks per host, each using a different rail:
# rank 0 → ib0
export NCCL_IB_HCA=ib0
export NCCL_IB_GID_INDEX=3
export NCCL_SOCKET_IFNAME=ib0
mpirun -np 1 ... my_training_rank_0
# rank 1 → ib1
NCCL_IB_HCA=ib1 NCCL_IB_GID_INDEX=3 ... my_training_rank_1
# etc.
In practice, NCCL launchers do this automatically via NUMA-aware rank-to-NIC assignment based on the nvidia-smi topo -m matrix. The launcher reads the PCIe topology, figures out which NIC is closest to which GPU, and sets NCCL_IB_HCA accordingly.
D.3 Test with arping (cheap connectivity check)
# Force arping to use a specific source IP
arping -I ib0 -s 10.0.1.14 -c 3 10.0.1.1
arping -I ib1 -s 10.0.2.14 -c 3 10.0.2.1
arping -I ib2 -s 10.0.3.14 -c 3 10.0.3.1
arping -I ib3 -s 10.0.4.14 -c 3 10.0.4.1
Each should reach its rail's leaf gateway successfully. If any fails or replies come back via a different NIC, your ARP/routing config is broken.
D.4 Troubleshoot when a NIC isn't carrying its traffic
# 1. Did the kernel even consider routing through this NIC?
ip route get <dst> from <nic_src_ip>
# 2. Is the rule in place?
ip rule show | grep <nic_src_ip>
# 3. Is the table populated?
ip route show table <table_id>
# 4. Are ARP settings correct?
sysctl net.ipv4.conf.<nic>.arp_ignore
sysctl net.ipv4.conf.<nic>.arp_announce
# 5. Is rp_filter dropping it?
nstat -az | grep -i martian
# Martian packets = rp_filter or accept_local dropped them
That five-step sequence resolves most "a NIC went quiet" issues. Walk it in order.
How the hyperscalers do it
| Operator | Multi-rail routing approach |
|---|---|
| Most production AI shops | Standard Linux source-based routing + per-NIC sysctls, rendered by config tools (Salt, Ansible) into NM keyfiles. |
| Meta | Same pattern, internal tooling. They contributed many of the upstream kernel improvements for multi-NIC routing. |
| Multi-NIC GCP VMs use route-based isolation; bare-metal GPU pods use source routing similar to the pattern here. | |
| NVIDIA DGX SuperPOD ref | Documented in the DGX networking guide — same source-based routing pattern, with mlnx_qos --trust dscp integration. |
| AWS EFA | Each EC2 instance gets ONE EFA per NIC; no in-host multi-rail routing needed because rails are isolated at the VM boundary. |
Common theme: Everyone does roughly the same thing. The Linux source-routing pattern is the de facto standard for multi-rail HPC hosts. It's not exotic — it's just unfortunately not the default. The kernel could ship with multi-NIC-aware policy routing out of the box, but it doesn't.
Self-check
- Why does default Linux destination-based routing fail for multi-rail RoCE?
- How many routing tables can Linux have, and which are the three default ones?
- What's the difference between
ip ruleandip route? - What's the ARP flux problem in one sentence? What two sysctls fix it?
rp_filter=2vsrp_filter=1— which mode for multi-rail, and why?ip route get 10.0.99.99 from 10.0.2.14returns "dev ib0". What's wrong?- NCCL chooses NIC via
NCCL_IB_HCA=ib1. Why does the kernel ALSO need to be configured to route viaib1? - What's
accept_local=1and when do you need it?
What you should remember
- Default Linux routing is destination-only — give it 4 NICs in overlapping subnets and it picks one, sends everything that way, and ignores the rest. Your fabric goes to ~25% utilization, silently.
- Linux has 256 routing tables. You'll use 4 of them (one per rail), each with a single default route via its NIC.
ip ruledecides which table to consult;ip routepopulates the table. The rule matches on source IP. Source IP comes from the NIC's QP. The match picks the right table. The table sends packets out the right NIC.- ARP flux is the silent killer of multi-rail. Set
arp_ignore=1(or=2) andarp_announce=2so each NIC only speaks for its own IP. Otherwise switches learn the wrong MAC↔IP mapping and your rails cross-pollinate. rp_filter=2(loose) on every NIC — strict mode drops legitimate ingress packets because the reply route would use a different NIC.accept_local=1keeps intra-host RDMA loopback tests from being classified as Martian.- Your debugging cheat code is
ip route get <dst> from <src>— it tells you exactly which NIC and table the kernel will pick. If that command lies to you, source routing is broken. - At 4 NICs × 400 Gbps, you should see ~1.5 Tbps off the host. Anything dramatically less than that on aggregate
ib_send_bwis almost always a routing problem before it's a hardware problem.
Next: NCCL and GPUDirect → — how NCCL pins ranks to rails using the topology matrix, what GPUDirect RDMA actually does to bypass host memory, and the env vars that make or break collective throughput.