NCCL-tests in practice — the mandatory pre-launch health check
A new cluster comes up. Don't run training yet. One sweep of NCCL-tests tells you whether it can host LLMs at all — here is how to read the busbw.
Why this can't be skipped
Stat from our customer base: clusters that skip NCCL-tests at handover hit a throughput regression in the first week 75% of the time. The cause is rarely a bad GPU — it's mis-detected topology, wrong IB HCA selection, GDR not enabled, bad PCIe affinity. These take days to diagnose from a training job, and 5 minutes from NCCL-tests.
1. Build
git clone https://github.com/NVIDIA/nccl-tests.git && cd nccl-tests
make MPI=1 \
MPI_HOME=/usr/local/openmpi \
NCCL_HOME=/usr/local/nccl \
CUDA_HOME=/usr/local/cudaWith NVIDIA HPCX, paths live under $HPCX_DIR:
make MPI=1 MPI_HOME=$HPCX_MPI_DIR NCCL_HOME=$HPCX_NCCL_DIROutput binaries land in build/: all_reduce_perf, all_gather_perf, reduce_scatter_perf, broadcast_perf. For training, 90% of your usage is all_reduce_perf.
2. Single-node (validate NVLink)
8×H100 single node:
./build/all_reduce_perf -b 8 -e 8G -f 2 -g 8Args: -b 8 start size 8 B, -e 8G end at 8 GB, -f 2 doubling step, -g 8 use 8 GPUs.
Read the busbw column — not algbw. Only busbw reflects actual physical-link utilization.
Reference numbers:
| Configuration | busbw at large sizes (GB/s) | Notes |
|---|---|---|
| 8×H100 SXM5 NVLink-only | 460–480 | Full-mesh NVLink-4, theoretical 450 GB/s unidirectional |
| 8×A100 SXM4 NVLink | 230–250 | NVLink-3 |
| 8×H100 PCIe (no NVSwitch) | 35–45 | PCIe Gen5 limited |
| 4×H100 PCIe + NVLink Bridge | 200–230 | Bridge is pairwise only |
If busbw is more than 20% below this, something is off — common cause is a stale NCCL_P2P_DISABLE=1 env var or a downed NVLink lane. Check nvidia-smi nvlink -s first.
3. Multi-node (validate IB)
2 nodes, 16 GPUs, hosts node1 and node2:
mpirun -np 16 -H node1:8,node2:8 \
--bind-to numa \
-x NCCL_DEBUG=INFO \
-x NCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_7 \
-x NCCL_IB_GID_INDEX=3 \
-x NCCL_SOCKET_IFNAME=ib0 \
-x LD_LIBRARY_PATH \
./build/all_reduce_perf -b 8 -e 8G -f 2 -g 1Reference (H100 + 8× NDR-400Gbps HCAs + GDR):
| Nodes | busbw at large sizes (GB/s) |
|---|---|
| 2 | 350–380 |
| 4 | 320–360 |
| 8 | 280–320 |
| 16 | 240–280 |
Falloff with node count is expected — more hops, more convergence. But 2 nodes below 200 GB/s = something to debug.
4. Triage when busbw is low
Re-run with NCCL_DEBUG=INFO. Walk through:
- GDR enabled? Look for
[send] via NET/IB/X/GDRDMAin the log. NoGDRDMAsubstring → GDR isn't taking effect; checklsmod | grep nvidia_peermem. - Right HCAs selected?
NCCL INFO NET/IB : Using [0]mlx5_0 ...should list as many NICs as you physically have. If fewer,NCCL_IB_HCAis wrong or the NIC is on a different NUMA. - Correct GID index? RoCEv2 ~ 3, pure IB ~ 0. Wrong value: connects but is silently slow.
- PXN on? Should see
NCCL INFO PXN ENABLED. Without it, intra-NUMA cross-GPU traffic crosses host memory — costs ~30%. - Algorithm: ring vs tree? NCCL auto-picks (tree for small sizes, ring for large). Hard-coding
NCCL_ALGO=Ringregresses small-size perf.
5. Minimum acceptance script for handover
Three commands we ship to customers, each with a pass line:
# 1) Intra-node NVLink — busbw @ 8GB must be >= 460 GB/s (H100)
./build/all_reduce_perf -b 1G -e 8G -f 2 -g 8
# 2) Node pair IB — busbw @ 8GB must be >= 350 GB/s (2 nodes, NDR)
mpirun -np 16 -H n1:8,n2:8 ... ./build/all_reduce_perf -b 1G -e 8G -f 2 -g 1
# 3) Full-cluster ring — falloff vs (1) must be within 25%
mpirun -np <total> -H ... ./build/all_reduce_perf -b 1G -e 8G -f 2 -g 1All three pass → cleared for training. Any miss → fix before bringing a real job up; debugging from inside a 70B finetune is much harder.
Last updated on
InfiniBand Fabric for GPU Clusters — bandwidth, rail-aligned topology, and triage
Why LLM clusters insist on InfiniBand, how rail-aligned saves 30% training time, and the ibstat / ibdev2netdev / ib_write_bw routine that fixes most field issues.
Faster HuggingFace model downloads — practical playbook
A 70B checkpoint is 140 GB. Direct from huggingface.co takes 4+ hours; over hf-mirror at line-rate 1 GbE it's 22 minutes. A working set of options for 2026.
