Topology Awareness

A production communication library doesn't treat every link the same. It builds a routing table for GPU-to-GPU reachability — and uses it.

After this page, you'll be able to

Rank the interconnect hierarchy — NVLink (900 GB/s) > PCIe Gen5 (128 GB/s) > InfiniBand (200–400 Gb/s) — and how NCCL chooses.
Map NCCL's preference to Administrative Distance — connected vs OSPF vs eBGP.
Explain fat-tree-aware rings — why NCCL keeps traffic rack-local, and what NCCL_TOPO_FILE overrides.

The hierarchy inside one server

Inside a single server — a DGX H100, say — GPU interconnects exist at several layers:

NVLink ≫ PCIe > InfiniBand > Ethernet. NCCL treats this like administrative distance and routes GPU-to-GPU on the lowest-distance path.

NCCL detects this hierarchy at initialisation. It always prefers NVLink for intra-node traffic, routes through PCIe when two GPUs sit in different NVLink groups on the same host, and uses InfiniBand only when it has to cross hosts.

📡 Analogy — Administrative Distance

NCCL's transport preference is Cisco's Administrative Distance. NVLink is a directly connected route (AD 0): always preferred. PCIe is OSPF (AD 110): used when direct isn't available. InfiniBand is eBGP (AD 20, in cluster terms): used for inter-node traffic. The library maintains its own routing table for GPU-to-GPU reachability — and picks the lowest-distance path every time.

And across the fabric — the fat-tree

Most large GPU clusters run a fat-tree / Clos topology at the InfiniBand or Ethernet layer. NCCL knows this and shapes its ring construction to minimise cross-bisection traffic.

In a fat-tree, spine links are less oversubscribed than edge links.
NCCL tries to form rings that communicate as much as possible within a rack (one leaf-switch domain) before crossing to the spine.
That's traffic engineering: prefer the low-latency, high-bandwidth intra-PoP path; touch the inter-PoP (spine) path only when necessary.
NCCL_TOPO_FILE lets you hand NCCL a custom topology description — like overriding IGP defaults with an explicit routing policy.

The fabric architecture section covers the Clos itself; this is just how the library routes over it.

💡 What you should remember

#		Concept	Why it matters
1	🏗️	NVLink > PCIe > InfiniBand, by bandwidth	900 vs 128 GB/s vs 200–400 Gb/s — NCCL always takes the fastest
2	📏	Transport preference = Administrative Distance	NCCL keeps a routing table for GPU reachability
3	🌲	Rings are built rack-local first	Keeps load off the oversubscribed spine; `NCCL_TOPO_FILE` overrides it

Next: Parallelism Strategies and Their Collectives → — data, tensor, pipeline, expert, sequence and ZeRO parallelism, and the specific collective each one puts on your wire.

The hierarchy inside one server​

And across the fabric — the fat-tree​

💡 What you should remember​

The hierarchy inside one server

And across the fabric — the fat-tree

💡 What you should remember