Network topology for kilo-GPU training — from Fat-Tree to Dragonfly+

Why does AllReduce tail latency scale non-linearly with cluster size? With NCCL topology-aware rewrites, we lifted a 1024-GPU cluster from 71% to 89% MFU.

Problem

A customer runs full pretraining of a 100B-parameter model on 1024 H800A GPUs using Megatron-LM with DP=8 / TP=4 / PP=32. Initial MFU was just 71%, well below the advertised ~85%. Profiler showed 25% of cycles in AllReduce, with P99 tail latency 4× the median.

First triage

The cluster is a classical 1:1 Fat-Tree — non-blocking on paper. But NCCL default topology discovery placed the 32-rank pipeline group across one spine, so AllReduce crossed spines repeatedly and hit tail latency.

Lever 1 — topology-aware NCCL

A handcrafted NCCL_TOPO_FILE mapped DP / TP / PP to leaf / spine / super-spine. NCCL picks rings that stay within a spine for intra-group, splits across spines for inter-group. AllReduce latency dropped to 60% of baseline.

MFU: 71% → 82%
Tail ratio P99/P50: 4× → 1.6×

Lever 2 — Dragonfly+

Pushing further to 89% needed a structural change. Fat-Tree (even multi-stage) is cabling-prohibitive past 4096 GPUs. Alaya's new IB cluster is Dragonfly+: 24 fully-connected leaves per group, with high-radix all-to-all between groups. A 32-rank pipeline now lives entirely inside one Dragonfly group:

MFU: 82% → 89%
Cable count down 40%; tangible facility savings

Takeaways

Beyond a thousand GPUs, network topology is a first-order training-efficiency lever. Before launching a training job:

Benchmark AllReduce tail; track P99/P50 ratio
Pin NCCL with an explicit NCCL_TOPO_FILE
At capacity-planning time, evaluate Dragonfly+ — don't scale Fat-Tree to its limit

Alaya HyperTrain defaults to topology-aware scheduling — same-pipeline-group ranks land on the same leaf with no operator effort.