Topology Awareness
A production communication library doesn't treat every link the same. It builds a routing table for GPU-to-GPU reachability — and uses it.
- Rank the interconnect hierarchy — NVLink (900 GB/s) > PCIe Gen5 (128 GB/s) > InfiniBand (200–400 Gb/s) — and how NCCL chooses.
- Map NCCL's preference to Administrative Distance — connected vs OSPF vs eBGP.
- Explain fat-tree-aware rings — why NCCL keeps traffic rack-local, and what
NCCL_TOPO_FILEoverrides.
The hierarchy inside one server
Inside a single server — a DGX H100, say — GPU interconnects exist at several layers:
NCCL detects this hierarchy at initialisation. It always prefers NVLink for intra-node traffic, routes through PCIe when two GPUs sit in different NVLink groups on the same host, and uses InfiniBand only when it has to cross hosts.
NCCL's transport preference is Cisco's Administrative Distance. NVLink is a directly connected route (AD 0): always preferred. PCIe is OSPF (AD 110): used when direct isn't available. InfiniBand is eBGP (AD 20, in cluster terms): used for inter-node traffic. The library maintains its own routing table for GPU-to-GPU reachability — and picks the lowest-distance path every time.
And across the fabric — the fat-tree
Most large GPU clusters run a fat-tree / Clos topology at the InfiniBand or Ethernet layer. NCCL knows this and shapes its ring construction to minimise cross-bisection traffic.
- In a fat-tree, spine links are less oversubscribed than edge links.
- NCCL tries to form rings that communicate as much as possible within a rack (one leaf-switch domain) before crossing to the spine.
- That's traffic engineering: prefer the low-latency, high-bandwidth intra-PoP path; touch the inter-PoP (spine) path only when necessary.
NCCL_TOPO_FILElets you hand NCCL a custom topology description — like overriding IGP defaults with an explicit routing policy.
The fabric architecture section covers the Clos itself; this is just how the library routes over it.
💡 What you should remember
| # | Concept | Why it matters | |
|---|---|---|---|
| 1 | 🏗️ | NVLink > PCIe > InfiniBand, by bandwidth | 900 vs 128 GB/s vs 200–400 Gb/s — NCCL always takes the fastest |
| 2 | 📏 | Transport preference = Administrative Distance | NCCL keeps a routing table for GPU reachability |
| 3 | 🌲 | Rings are built rack-local first | Keeps load off the oversubscribed spine; NCCL_TOPO_FILE overrides it |
Next: Parallelism Strategies and Their Collectives → — data, tensor, pipeline, expert, sequence and ZeRO parallelism, and the specific collective each one puts on your wire.