InfiniBand Fabric for GPU Clusters — bandwidth, rail-aligned topology, and triage
Why LLM clusters insist on InfiniBand, how rail-aligned saves 30% training time, and the ibstat / ibdev2netdev / ib_write_bw routine that fixes most field issues.
1. IB vs RoCE — pick the path before talking speed
InfiniBand (IB) is the Mellanox / NVIDIA stack with native lossless, credit-based flow control. RoCEv2 runs IB verbs over Ethernet + DCB (PFC + ECN). Bandwidth is the same; the differences:
| Axis | IB | RoCEv2 |
|---|---|---|
| Flow control | Per-link credits, lossless out of the box | Relies on DCB; misconfig drops packets |
| End-to-end latency | ~1 µs | ~1.5–2 µs |
| Operations effort | Single stack, subnet manager owns the fabric | Multi-vendor friendly, but tuning is brutal |
| Cost | Pricier switches | Cheap if reusing Ethernet fabric |
Rule of thumb: training clusters of 256 GPUs and up — go IB. Hybrid-cloud / brownfield Ethernet / inference-heavy fleets — RoCEv2 is fine, but you must get PFC + ECN right. A misconfigured RoCE is twice as slow as IB.
2. Speed generations
| Gen | Per-port rate | Year | Typical use |
|---|---|---|---|
| EDR | 100 Gbps | 2014 | Early A100 clusters |
| HDR | 200 Gbps | 2018 | Mainstream A100 |
| NDR | 400 Gbps | 2022 | Mainstream H100 |
| XDR | 800 Gbps | 2024–25 | H200 / B200 |
A standard H100 node is "8 GPUs + 8 NDR-400 HCAs" — one dedicated NIC per GPU, 3.2 Tbps node egress. Drop to 4 HCAs and you halve it to 1.6 Tbps. "Saving on NICs" is one of the biggest LLM-training perf killers we see.
3. Why rail-aligned matters for LLM workloads
Collective traffic in LLM training has a fixed rank pattern: data-parallel all-reduce always pairs rank-0 ↔ rank-0, rank-1 ↔ rank-1, …; tensor-parallel stays inside a node.
Rail-aligned wiring puts "same-index GPUs on the same leaf":
Node A: GPU0─HCA0 ┐ Node B: GPU0─HCA0 ┐
GPU1─HCA1 │ GPU1─HCA1 │
... │ Leaf #0 ... │ Leaf #0
GPU7─HCA7 ┘ GPU7─HCA7 ┘
Leaf #1 Leaf #1
... ...
Leaf #7 Leaf #7→ DP traffic stays inside one rail (one leaf), never traverses the spine. Hops drop from 5 to 3 — 30% lower latency, 30% steadier bandwidth.
On a 256-GPU cluster we measured rail-aligned + balanced delivering 28% higher training throughput than random wiring. That is real money.
4. PCIe affinity & GDR
At NDR, the PCIe topology becomes the new bottleneck. Read GPU↔NIC affinity from nvidia-smi topo -m:
GPU0 GPU1 ... mlx5_0 mlx5_1
GPU0 X NV4 ... PIX SYS
GPU1 NV4 X ... SYS PIXWhat you want: each GPU shows PIX against its dedicated NIC (same PCIe switch). SYS (cross-NUMA via host) breaks GDR direct GPU↔NIC DMA — and halves perf.
Three steps to enable GDR:
sudo modprobe nvidia_peermem # older name: nv_peer_mem
lsmod | grep nvidia_peermem
# NCCL picks GDR automatically; grep the startup log for "GDRDMA"5. Field commands you'll keep using
# 1) Is the NIC active?
ibstat # State: Active, PhysState: LinkUp
ibstatus # rate: 400 Gb/sec (4xNDR)
# 2) IB device ↔ Linux netdev mapping
ibdev2netdev -v
# mlx5_0 port 1 ==> ibp1s0 (Up)
# 3) Point-to-point bandwidth
# server
ib_write_bw -d mlx5_0 --report_gbits
# client
ib_write_bw -d mlx5_0 --report_gbits <server_ip>
# 4) Point-to-point latency
ib_write_lat -d mlx5_0 <server_ip>
# 5) Subnet Manager is alive (IB requires exactly one)
sminfo
# 6) Switch port error counters
ibqueryerrors --detailsib_write_bw should hit ~390 Gbps on NDR-400 (not the nominal 400 — encoding overhead). Lower than that = physical link or driver issue. Rising LinkErrorRecoveryCounter / SymbolErrorCounter from ibqueryerrors = bad optic or cable. Replace.
6. Recurring field issues
- "Two-node NCCL is half-speed" — check
nvidia-smi topo -mfor SYS instead of PIX. Almost always the NIC was installed in the wrong PCIe slot at rack-build time. - "Throughput dropped 30% overnight" —
ibqueryerrorsshowing growingSymbolErrorCounter. Usually a dirty or kinked fiber; SM auto-downgrades the link. - "RoCE is half the speed of IB on the same wire" — switch DCB missing PFC, or PFC priority misaligned with ECN. Tuning RoCE end-to-end is roughly two engineer-weeks; budget accordingly.
- "Node X dropped out of the fabric" —
ibhostsshows non-contiguous LIDs. The HCA reset and didn't rejoin; bouncingopensmdputs it back.
7. Network acceptance at handover
Pair this with the NCCL-tests acceptance from the previous post. Minimum five checks:
ibstat— all ports Active and at rated speednvidia-smi topo -m— every GPU↔dedicated NIC pair showsPIXnvidia_peermemloadedib_write_bwpoint-to-point >= 380 Gbps (NDR)ibqueryerrors --details— counters at 0 or not increasing
Pass all five, then run NCCL-tests. If the order is reversed, you can't tell whether a low busbw is hardware or software.
Last updated on
