Alaya NeW Cloud

InfiniBand Fabric for GPU Clusters — bandwidth, rail-aligned topology, and triage

Why LLM clusters insist on InfiniBand, how rail-aligned saves 30% training time, and the ibstat / ibdev2netdev / ib_write_bw routine that fixes most field issues.

1. IB vs RoCE — pick the path before talking speed

InfiniBand (IB) is the Mellanox / NVIDIA stack with native lossless, credit-based flow control. RoCEv2 runs IB verbs over Ethernet + DCB (PFC + ECN). Bandwidth is the same; the differences:

AxisIBRoCEv2
Flow controlPer-link credits, lossless out of the boxRelies on DCB; misconfig drops packets
End-to-end latency~1 µs~1.5–2 µs
Operations effortSingle stack, subnet manager owns the fabricMulti-vendor friendly, but tuning is brutal
CostPricier switchesCheap if reusing Ethernet fabric

Rule of thumb: training clusters of 256 GPUs and up — go IB. Hybrid-cloud / brownfield Ethernet / inference-heavy fleets — RoCEv2 is fine, but you must get PFC + ECN right. A misconfigured RoCE is twice as slow as IB.

2. Speed generations

GenPer-port rateYearTypical use
EDR100 Gbps2014Early A100 clusters
HDR200 Gbps2018Mainstream A100
NDR400 Gbps2022Mainstream H100
XDR800 Gbps2024–25H200 / B200

A standard H100 node is "8 GPUs + 8 NDR-400 HCAs" — one dedicated NIC per GPU, 3.2 Tbps node egress. Drop to 4 HCAs and you halve it to 1.6 Tbps. "Saving on NICs" is one of the biggest LLM-training perf killers we see.

3. Why rail-aligned matters for LLM workloads

Collective traffic in LLM training has a fixed rank pattern: data-parallel all-reduce always pairs rank-0 ↔ rank-0, rank-1 ↔ rank-1, …; tensor-parallel stays inside a node.

Rail-aligned wiring puts "same-index GPUs on the same leaf":

Node A: GPU0─HCA0 ┐    Node B: GPU0─HCA0 ┐
        GPU1─HCA1 │            GPU1─HCA1 │
        ...       │ Leaf #0    ...       │ Leaf #0
        GPU7─HCA7 ┘            GPU7─HCA7 ┘
                  Leaf #1                  Leaf #1
                  ...                      ...
                  Leaf #7                  Leaf #7

→ DP traffic stays inside one rail (one leaf), never traverses the spine. Hops drop from 5 to 3 — 30% lower latency, 30% steadier bandwidth.

On a 256-GPU cluster we measured rail-aligned + balanced delivering 28% higher training throughput than random wiring. That is real money.

4. PCIe affinity & GDR

At NDR, the PCIe topology becomes the new bottleneck. Read GPU↔NIC affinity from nvidia-smi topo -m:

        GPU0  GPU1  ...  mlx5_0  mlx5_1
GPU0     X    NV4   ...   PIX     SYS
GPU1    NV4    X    ...   SYS     PIX

What you want: each GPU shows PIX against its dedicated NIC (same PCIe switch). SYS (cross-NUMA via host) breaks GDR direct GPU↔NIC DMA — and halves perf.

Three steps to enable GDR:

sudo modprobe nvidia_peermem    # older name: nv_peer_mem
lsmod | grep nvidia_peermem
# NCCL picks GDR automatically; grep the startup log for "GDRDMA"

5. Field commands you'll keep using

# 1) Is the NIC active?
ibstat                          # State: Active, PhysState: LinkUp
ibstatus                        # rate: 400 Gb/sec (4xNDR)

# 2) IB device ↔ Linux netdev mapping
ibdev2netdev -v
# mlx5_0 port 1 ==> ibp1s0 (Up)

# 3) Point-to-point bandwidth
# server
ib_write_bw -d mlx5_0 --report_gbits
# client
ib_write_bw -d mlx5_0 --report_gbits <server_ip>

# 4) Point-to-point latency
ib_write_lat -d mlx5_0 <server_ip>

# 5) Subnet Manager is alive (IB requires exactly one)
sminfo

# 6) Switch port error counters
ibqueryerrors --details

ib_write_bw should hit ~390 Gbps on NDR-400 (not the nominal 400 — encoding overhead). Lower than that = physical link or driver issue. Rising LinkErrorRecoveryCounter / SymbolErrorCounter from ibqueryerrors = bad optic or cable. Replace.

6. Recurring field issues

  • "Two-node NCCL is half-speed" — check nvidia-smi topo -m for SYS instead of PIX. Almost always the NIC was installed in the wrong PCIe slot at rack-build time.
  • "Throughput dropped 30% overnight"ibqueryerrors showing growing SymbolErrorCounter. Usually a dirty or kinked fiber; SM auto-downgrades the link.
  • "RoCE is half the speed of IB on the same wire" — switch DCB missing PFC, or PFC priority misaligned with ECN. Tuning RoCE end-to-end is roughly two engineer-weeks; budget accordingly.
  • "Node X dropped out of the fabric"ibhosts shows non-contiguous LIDs. The HCA reset and didn't rejoin; bouncing opensmd puts it back.

7. Network acceptance at handover

Pair this with the NCCL-tests acceptance from the previous post. Minimum five checks:

  1. ibstat — all ports Active and at rated speed
  2. nvidia-smi topo -m — every GPU↔dedicated NIC pair shows PIX
  3. nvidia_peermem loaded
  4. ib_write_bw point-to-point >= 380 Gbps (NDR)
  5. ibqueryerrors --details — counters at 0 or not increasing

Pass all five, then run NCCL-tests. If the order is reversed, you can't tell whether a low busbw is hardware or software.

Last updated on

Was this page helpful?

On this page