Network topology for kilo-GPU training — from Fat-Tree to Dragonfly+
Why does AllReduce tail latency scale non-linearly with cluster size? With NCCL topology-aware rewrites, we lifted a 1024-GPU cluster from 71% to 89% MFU.
Problem
A customer runs full pretraining of a 100B-parameter model on 1024 H100 GPUs using Megatron-LM with DP=8 / TP=4 / PP=32. Initial MFU was just 71%, well below the advertised ~85%. Profiler showed 25% of cycles in AllReduce, with P99 tail latency 4× the median.
First triage
The cluster is a classical 1:1 Fat-Tree — non-blocking on paper. But NCCL default topology discovery placed the 32-rank pipeline group across one spine, so AllReduce crossed spines repeatedly and hit tail latency.
Lever 1 — topology-aware NCCL
A handcrafted NCCL_TOPO_FILE mapped DP / TP / PP to leaf / spine / super-spine. NCCL picks rings that stay within a spine for intra-group, splits across spines for inter-group. AllReduce latency dropped to 60% of baseline.
- MFU: 71% → 82%
- Tail ratio P99/P50: 4× → 1.6×
Lever 2 — Dragonfly+
Pushing further to 89% needed a structural change. Fat-Tree (even multi-stage) is cabling-prohibitive past 4096 GPUs. Alaya's new IB cluster is Dragonfly+: 24 fully-connected leaves per group, with high-radix all-to-all between groups. A 32-rank pipeline now lives entirely inside one Dragonfly group:
- MFU: 82% → 89%
- Cable count down 40%; tangible facility savings
Takeaways
Beyond a thousand GPUs, network topology is a first-order training-efficiency lever. Before launching a training job:
- Benchmark AllReduce tail; track P99/P50 ratio
- Pin NCCL with an explicit
NCCL_TOPO_FILE - At capacity-planning time, evaluate Dragonfly+ — don't scale Fat-Tree to its limit
Alaya HyperTrain defaults to topology-aware scheduling — same-pipeline-group ranks land on the same leaf with no operator effort.
Last updated on
Pushing vLLM to 4500 tokens/s on H100
A single 8×H100 node serving Qwen3-72B-Instruct (quantized). End-to-end notes on paged attention, continuous batching, and KV-cache hit-rate tuning — a full-stack throughput hunt.
Lifting multi-tenant isolation to Confidential Compute grade
MIG + Confidential VM, GID-pinned RDMA, NVMe cryptographic erase — the isolation stack we shipped so finance and government audits pass clean.
