Skip to main content

Hash Polarization & Elephant Flows

ECMP is one of the great inventions of fabric design. It takes N equal-cost paths between two endpoints and hashes flows across them — perfectly balancing the load if the flows are statistically independent and short.

AI training violates both assumptions. The flows are synchronized, identical in shape, and persistent for seconds. ECMP doesn't fail mathematically; it fails empirically at exactly the wrong moments.

This page explains the mechanism and what hyperscalers do about it.


The elephant flow problem

In a traditional DC:

  • Most flows are small (mice — KB to MB).
  • Flows arrive and depart randomly.
  • Hash(5-tuple) over 8 ECMP paths gives ~12.5% load per path. Mostly balanced.

In AI training:

  • Every flow is an elephant — GB to TB.
  • Every flow arrives at the same time (start of AllReduce).
  • Every flow has the same 5-tuple shape — same source IP range (GPU NICs), same dest range, same dest port (RoCE v2 = UDP 4791).

When 8 elephant flows all hash to the same egress port, 6 of 8 elephants traverse one path. Two paths sit idle. That's hash polarization. It's deterministic — the same RoCE v2 traffic pattern collides the same way every step, for the entire training job.


Why traditional ECMP can't fix this

ECMP's hash uses the 5-tuple: <src IP, dst IP, src port, dst port, protocol>. In RoCE v2 traffic:

  • src IP — limited range (a server's NIC IPs).
  • dst IP — limited range (target server's NIC IPs).
  • src port — picked by the NIC. Sometimes randomized; sometimes deterministic per queue pair.
  • dst portalways 4791 (RoCE v2 well-known port).
  • protocolalways UDP.

So the entropy that ECMP needs comes mostly from the src port. If the NIC picks src ports deterministically (per QP), every step of training hashes to the same egress port. Persistent collisions.

You can try to fix this by tuning the hash:

  • Use UDP src port entropy — some NICs let you randomize it per packet.
  • Add the QP number into the hash — vendor-specific.
  • Increase ECMP fan-out — higher-radix switches have more paths.
  • Add a tunnel layer — VXLAN or similar with random outer port. Adds overhead.

These help but don't fully solve the problem at 100K-GPU scale.


What hyperscalers do

Three approaches, each going further than the last.

1. Static rail-optimized topology (mid scale)

By using rail-optimized topology (previous page), you reduce the problem: each rail has its own leaf, and the GPUs that need to talk to each other (same index) only share a rail. ECMP collisions on a single rail are smaller-radius.

This works for up to ~2K GPUs. Beyond that, even within a single rail you have too many elephants.

2. Packet spraying (custom transports)

The hyperscalers built new transports specifically to solve this. From Section 1 (Transport Options):

  • AWS SRD — sprays packets across all paths to the destination, reassembles at the receiver.
  • Google Falcon — Path Load Balancing (PLB) in the NIC.
  • MRC (OpenAI/Microsoft/NVIDIA/AMD/Broadcom/Intel) — multipath transport, packet spraying with μs failover.
  • UEC (the open standard) — packet spraying built into the transport layer.

These break the "one flow = one path" assumption that ECMP needs. Instead of relying on the network to balance, the transport takes responsibility for spreading bytes across paths and reassembling out-of-order.

Trade-off: the receive side has to reassemble. Hardware reassembly is required for this to work at 400 Gbps. That's why MRC, Falcon, SRD all ship as new NIC silicon.

3. Adaptive routing (NVIDIA Spectrum-X, IB)

Some switches do adaptive routing in silicon — they look at per-port queue depths in real time and route around congestion per packet. This is the InfiniBand approach (built in since 1999) and what NVIDIA Spectrum-X brings to Ethernet.

Adaptive routing has the same effect as packet spraying — packets arrive out of order, transport has to reassemble — but the decision is made by the switch, not the sender.


What this curriculum picks

You met this in Section 1: RoCEv2 + DCQCN + PFC/ECN is what most production AI fabrics run today, and rail-optimized topology is how they cope with the hash-polarization problem without going to packet-sprayed transports.

For 100K+ GPU scale, the long-term answer is UEC (Ultra Ethernet) or MRC — open / hyperscaler-grade transports that bake packet spraying in. That's where the industry is heading.

For now: rail-optimization + careful ECMP-hash tuning + per-priority QoS gets you to ~10K-GPU scale.


What you should remember

  • ECMP works because flows are independent and short. AI training flows are synchronized and persistent.
  • RoCE v2 flows have low 5-tuple entropy — same dst port (4791), same protocol (UDP), narrow src/dst ranges.
  • Hash polarization is deterministic — the same training step collides the same way, every step, for the whole job.
  • Rail-optimized topology helps by reducing the collision radius.
  • Packet-spraying transports (MRC, Falcon, SRD, UET) are the long-term industry answer. They move balance from the network to the transport.
  • Adaptive routing (IB, Spectrum-X) moves the balance into the switch silicon — same effect, different layer.

Next: more sections incoming — Building a Training Cluster (deployment), Switch QoS (PFC / ECN / DCQCN configuration), Host Networking (SR-IOV, Multus). For now, head back to the curriculum index or revisit RoCE v2.