Skip to main content

Topology Awareness

A production communication library doesn't treat every link the same. It builds a routing table for GPU-to-GPU reachability — and uses it.

After this page, you'll be able to
  1. Rank the interconnect hierarchy — NVLink (900 GB/s) > PCIe Gen5 (128 GB/s) > InfiniBand (200–400 Gb/s) — and how NCCL chooses.
  2. Map NCCL's preference to Administrative Distance — connected vs OSPF vs eBGP.
  3. Explain fat-tree-aware rings — why NCCL keeps traffic rack-local, and what NCCL_TOPO_FILE overrides.

The hierarchy inside one server

Inside a single server — a DGX H100, say — GPU interconnects exist at several layers:

bar length ∝ log(bandwidth) — NCCL always prefers the fastest availableNVLink 5same server · GPU↔GPUconnected route · AD 0900 GB/sPCIe Gen5 ×16same host · across NVLink domainsOSPF · AD 110128 GB/sInfiniBand NDRcross-hosteBGP400 Gb/s100 GbE · TCPfallback · mgmtlast resort100 Gb/s
NVLink ≫ PCIe > InfiniBand > Ethernet. NCCL treats this like administrative distance and routes GPU-to-GPU on the lowest-distance path.

NCCL detects this hierarchy at initialisation. It always prefers NVLink for intra-node traffic, routes through PCIe when two GPUs sit in different NVLink groups on the same host, and uses InfiniBand only when it has to cross hosts.

📡 Analogy — Administrative Distance

NCCL's transport preference is Cisco's Administrative Distance. NVLink is a directly connected route (AD 0): always preferred. PCIe is OSPF (AD 110): used when direct isn't available. InfiniBand is eBGP (AD 20, in cluster terms): used for inter-node traffic. The library maintains its own routing table for GPU-to-GPU reachability — and picks the lowest-distance path every time.


And across the fabric — the fat-tree

Most large GPU clusters run a fat-tree / Clos topology at the InfiniBand or Ethernet layer. NCCL knows this and shapes its ring construction to minimise cross-bisection traffic.

  • In a fat-tree, spine links are less oversubscribed than edge links.
  • NCCL tries to form rings that communicate as much as possible within a rack (one leaf-switch domain) before crossing to the spine.
  • That's traffic engineering: prefer the low-latency, high-bandwidth intra-PoP path; touch the inter-PoP (spine) path only when necessary.
  • NCCL_TOPO_FILE lets you hand NCCL a custom topology description — like overriding IGP defaults with an explicit routing policy.

The fabric architecture section covers the Clos itself; this is just how the library routes over it.


💡 What you should remember

#ConceptWhy it matters
1🏗️NVLink > PCIe > InfiniBand, by bandwidth900 vs 128 GB/s vs 200–400 Gb/s — NCCL always takes the fastest
2📏Transport preference = Administrative DistanceNCCL keeps a routing table for GPU reachability
3🌲Rings are built rack-local firstKeeps load off the oversubscribed spine; NCCL_TOPO_FILE overrides it

Next: Parallelism Strategies and Their Collectives → — data, tensor, pipeline, expert, sequence and ZeRO parallelism, and the specific collective each one puts on your wire.