InfiniBand Fabric for GPU Clusters — bandwidth, rail-aligned topology, and triage

Why LLM clusters insist on InfiniBand, how rail-aligned saves 30% training time, and the ibstat / ibdev2netdev / ib_write_bw routine that fixes most field issues.

1. IB vs RoCE — pick the path before talking speed

InfiniBand (IB) is the Mellanox / NVIDIA stack with native lossless, credit-based flow control. RoCEv2 runs IB verbs over Ethernet + DCB (PFC + ECN). Bandwidth is the same; the differences:

Axis	IB	RoCEv2
Flow control	Per-link credits, lossless out of the box	Relies on DCB; misconfig drops packets
End-to-end latency	~1 µs	~1.5–2 µs
Operations effort	Single stack, subnet manager owns the fabric	Multi-vendor friendly, but tuning is brutal
Cost	Pricier switches	Cheap if reusing Ethernet fabric

Rule of thumb: training clusters of 256 GPUs and up — go IB. Hybrid-cloud / brownfield Ethernet / inference-heavy fleets — RoCEv2 is fine, but you must get PFC + ECN right. A misconfigured RoCE is twice as slow as IB.

2. Speed generations

Gen	Per-port rate	Year	Typical use
EDR	100 Gbps	2014	Early A100 clusters
HDR	200 Gbps	2018	Mainstream A100
NDR	400 Gbps	2022	Mainstream H800A
XDR	800 Gbps	2024–25	H200 / B200

A standard H800A node is "8 GPUs + 8 NDR-400 HCAs" — one dedicated NIC per GPU, 3.2 Tbps node egress. Drop to 4 HCAs and you halve it to 1.6 Tbps. "Saving on NICs" is one of the biggest LLM-training perf killers we see.

3. Why rail-aligned matters for LLM workloads

Collective traffic in LLM training has a fixed rank pattern: data-parallel all-reduce always pairs rank-0 ↔ rank-0, rank-1 ↔ rank-1, …; tensor-parallel stays inside a node.

Rail-aligned wiring puts "same-index GPUs on the same leaf":

Node A: GPU0─HCA0 ┐    Node B: GPU0─HCA0 ┐
        GPU1─HCA1 │            GPU1─HCA1 │
        ...       │ Leaf #0    ...       │ Leaf #0
        GPU7─HCA7 ┘            GPU7─HCA7 ┘
                  Leaf #1                  Leaf #1
                  ...                      ...
                  Leaf #7                  Leaf #7

→ DP traffic stays inside one rail (one leaf), never traverses the spine. Hops drop from 5 to 3 — 30% lower latency, 30% steadier bandwidth.

On a 256-GPU cluster we measured rail-aligned + balanced delivering 28% higher training throughput than random wiring. That is real money.

4. PCIe affinity & GDR

At NDR, the PCIe topology becomes the new bottleneck. Read GPU↔NIC affinity from nvidia-smi topo -m:

        GPU0  GPU1  ...  mlx5_0  mlx5_1
GPU0     X    NV4   ...   PIX     SYS
GPU1    NV4    X    ...   SYS     PIX

What you want: each GPU shows PIX against its dedicated NIC (same PCIe switch). SYS (cross-NUMA via host) breaks GDR direct GPU↔NIC DMA — and halves perf.

Three steps to enable GDR:

sudo modprobe nvidia_peermem    # older name: nv_peer_mem
lsmod | grep nvidia_peermem
# NCCL picks GDR automatically; grep the startup log for "GDRDMA"

5. Field commands you'll keep using

# 1) Is the NIC active?
ibstat                          # State: Active, PhysState: LinkUp
ibstatus                        # rate: 400 Gb/sec (4xNDR)

# 2) IB device ↔ Linux netdev mapping
ibdev2netdev -v
# mlx5_0 port 1 ==> ibp1s0 (Up)

# 3) Point-to-point bandwidth
# server
ib_write_bw -d mlx5_0 --report_gbits
# client
ib_write_bw -d mlx5_0 --report_gbits <server_ip>

# 4) Point-to-point latency
ib_write_lat -d mlx5_0 <server_ip>

# 5) Subnet Manager is alive (IB requires exactly one)
sminfo

# 6) Switch port error counters
ibqueryerrors --details

ib_write_bw should hit ~390 Gbps on NDR-400 (not the nominal 400 — encoding overhead). Lower than that = physical link or driver issue. Rising LinkErrorRecoveryCounter / SymbolErrorCounter from ibqueryerrors = bad optic or cable. Replace.

6. Recurring field issues

"Two-node NCCL is half-speed" — check nvidia-smi topo -m for SYS instead of PIX. Almost always the NIC was installed in the wrong PCIe slot at rack-build time.
"Throughput dropped 30% overnight" — ibqueryerrors showing growing SymbolErrorCounter. Usually a dirty or kinked fiber; SM auto-downgrades the link.
"RoCE is half the speed of IB on the same wire" — switch DCB missing PFC, or PFC priority misaligned with ECN. Tuning RoCE end-to-end is roughly two engineer-weeks; budget accordingly.
"Node X dropped out of the fabric" — ibhosts shows non-contiguous LIDs. The HCA reset and didn't rejoin; bouncing opensmd puts it back.

7. Network acceptance at handover

Pair this with the NCCL-tests acceptance from the previous post. Minimum five checks:

ibstat — all ports Active and at rated speed
nvidia-smi topo -m — every GPU↔dedicated NIC pair shows PIX
nvidia_peermem loaded
ib_write_bw point-to-point >= 380 Gbps (NDR)
ibqueryerrors --details — counters at 0 or not increasing

Pass all five, then run NCCL-tests. If the order is reversed, you can't tell whether a low busbw is hardware or software.