NCCL-tests in practice — the mandatory pre-launch health check

A new cluster comes up. Don't run training yet. One sweep of NCCL-tests tells you whether it can host LLMs at all — here is how to read the busbw.

Why this can't be skipped

Stat from our customer base: clusters that skip NCCL-tests at handover hit a throughput regression in the first week 75% of the time. The cause is rarely a bad GPU — it's mis-detected topology, wrong IB HCA selection, GDR not enabled, bad PCIe affinity. These take days to diagnose from a training job, and 5 minutes from NCCL-tests.

1. Build

git clone https://github.com/NVIDIA/nccl-tests.git && cd nccl-tests
make MPI=1 \
     MPI_HOME=/usr/local/openmpi \
     NCCL_HOME=/usr/local/nccl \
     CUDA_HOME=/usr/local/cuda

With NVIDIA HPCX, paths live under $HPCX_DIR:

make MPI=1 MPI_HOME=$HPCX_MPI_DIR NCCL_HOME=$HPCX_NCCL_DIR

Output binaries land in build/: all_reduce_perf, all_gather_perf, reduce_scatter_perf, broadcast_perf. For training, 90% of your usage is all_reduce_perf.

2. Single-node (validate NVLink)

8×H800A single node:

./build/all_reduce_perf -b 8 -e 8G -f 2 -g 8

Args: -b 8 start size 8 B, -e 8G end at 8 GB, -f 2 doubling step, -g 8 use 8 GPUs.

Read the busbw column — not algbw. Only busbw reflects actual physical-link utilization.

Reference numbers:

Configuration	busbw at large sizes (GB/s)	Notes
8×H800A SXM5 NVLink-only	460–480	Full-mesh NVLink-4, theoretical 450 GB/s unidirectional
8×A100 SXM4 NVLink	230–250	NVLink-3
8×H800A PCIe (no NVSwitch)	35–45	PCIe Gen5 limited
4×H800A PCIe + NVLink Bridge	200–230	Bridge is pairwise only

If busbw is more than 20% below this, something is off — common cause is a stale NCCL_P2P_DISABLE=1 env var or a downed NVLink lane. Check nvidia-smi nvlink -s first.

3. Multi-node (validate IB)

2 nodes, 16 GPUs, hosts node1 and node2:

mpirun -np 16 -H node1:8,node2:8 \
  --bind-to numa \
  -x NCCL_DEBUG=INFO \
  -x NCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_7 \
  -x NCCL_IB_GID_INDEX=3 \
  -x NCCL_SOCKET_IFNAME=ib0 \
  -x LD_LIBRARY_PATH \
  ./build/all_reduce_perf -b 8 -e 8G -f 2 -g 1

Reference (H800A + 8× NDR-400Gbps HCAs + GDR):

Nodes	busbw at large sizes (GB/s)
2	350–380
4	320–360
8	280–320
16	240–280

Falloff with node count is expected — more hops, more convergence. But 2 nodes below 200 GB/s = something to debug.

4. Triage when busbw is low

Re-run with NCCL_DEBUG=INFO. Walk through:

GDR enabled? Look for [send] via NET/IB/X/GDRDMA in the log. No GDRDMA substring → GDR isn't taking effect; check lsmod | grep nvidia_peermem.
Right HCAs selected? NCCL INFO NET/IB : Using [0]mlx5_0 ... should list as many NICs as you physically have. If fewer, NCCL_IB_HCA is wrong or the NIC is on a different NUMA.
Correct GID index? RoCEv2 ~ 3, pure IB ~ 0. Wrong value: connects but is silently slow.
PXN on? Should see NCCL INFO PXN ENABLED. Without it, intra-NUMA cross-GPU traffic crosses host memory — costs ~30%.
Algorithm: ring vs tree? NCCL auto-picks (tree for small sizes, ring for large). Hard-coding NCCL_ALGO=Ring regresses small-size perf.

5. Minimum acceptance script for handover

Three commands we ship to customers, each with a pass line:

# 1) Intra-node NVLink — busbw @ 8GB must be >= 460 GB/s (H800A)
./build/all_reduce_perf -b 1G -e 8G -f 2 -g 8

# 2) Node pair IB — busbw @ 8GB must be >= 350 GB/s (2 nodes, NDR)
mpirun -np 16 -H n1:8,n2:8 ... ./build/all_reduce_perf -b 1G -e 8G -f 2 -g 1

# 3) Full-cluster ring — falloff vs (1) must be within 25%
mpirun -np <total> -H ... ./build/all_reduce_perf -b 1G -e 8G -f 2 -g 1

All three pass → cleared for training. Any miss → fix before bringing a real job up; debugging from inside a 70B finetune is much harder.