GPU Monitoring & XID triage — six tables for on-call

What to read in nvidia-smi vs DCGM, what XID 79 / 31 / 119 actually mean, and when to RMA. A pocket reference for SREs and escalation engineers.

Context

Across customer fleets running H800A / A100 / 4090, we built a "what to look at first when something goes wrong" cheatsheet. The six tables below cover the must-watch metrics, the must-install tools, the most common XID errors, temperature and power thresholds, and when to RMA. Print and pin to the wall.

Table 1 — eight nvidia-smi fields that matter

nvidia-smi --query-gpu=index,name,temperature.gpu,utilization.gpu,memory.used,memory.total,power.draw,ecc.errors.uncorrected.volatile.total --format=csv

Field	Healthy range	Meaning
`temperature.gpu`	H800A < 85°C / A100 < 88°C	Sustained breach triggers thermal throttling
`utilization.gpu`	training >= 90% / inference workload-dependent	Training stuck < 70% = data or scheduling bottleneck
`memory.used`	up to 95% of total	Leave 1–2 GB for NCCL buffers
`power.draw`	H800A SXM <= 700W	Above this and BMC will pull voltage
`ecc.errors.uncorrected.volatile.total`	must be 0	Non-zero = drain workloads immediately
`pstate`	P0 in training	P2 means it's idle

Table 2 — metrics only DCGM can give you

nvidia-smi samples at 1 Hz, too coarse for production. Run dcgm-exporter and scrape these into Prometheus:

Metric	Use
`DCGM_FI_PROF_SM_ACTIVE`	True SM activity (vs `utilization.gpu`, which is occupancy)
`DCGM_FI_PROF_PIPE_TENSOR_ACTIVE`	Tensor Core utilization. < 30% = your op isn't going through TC
`DCGM_FI_PROF_DRAM_ACTIVE`	HBM bandwidth utilization
`DCGM_FI_PROF_NVLINK_TX/RX_BYTES`	NVLink traffic; one direction should stay < 900 GB/s
`DCGM_FI_DEV_XID_ERRORS`	XID error count per second

A common bad picture: GPU util 90% + Tensor Core 25% — you're running an op in fp32 that should be fp16.

Table 3 — common XID errors and what to do

XID	Meaning	First-line action
13	Graphics Engine Exception	Driver bug or app OOB; restart process. Repeated → downgrade driver
31	MMU fault, address out of range	Usually user-code CUDA OOB, not hardware
43	GPU stopped processing	Process crashed or OOM; restart container
63	ECC page retirement	HBM correctable error, page now masked. Log and watch
64	ECC page retirement failure	Retirement failed → drain the node
74	NVLink error	Check `nvidia-smi nvlink -e`; multi-GPU at once → reboot node
79	GPU fallen off the bus	Hardware fault; if reboot doesn't help, RMA
119 / 120	Confidential Compute / unrecoverable HBM	Unrecoverable; drain, RMA

Our policy in the field: 13/31/43 auto-retry once; 63 with running count > 5 → drain & flag; 64/79/119/120 → cordon node + escalate to hardware ops.

Table 4 — where the logs live

# Kernel
dmesg -T | grep -i -E "nvrm|nvidia|xid"
journalctl -k --since "1 hour ago" | grep -i nvidia

# Userspace (persistent)
/var/log/nvidia-installer.log
/var/log/syslog or /var/log/messages

# DCGM health checks
dcgmi diag -r 3                 # ~30-min Lv3 sweep
dcgmi diag -r 1 -j > diag.json  # 5-min quick check

Always copy the timestamp of any XID into the ticket — RMA chains lose months otherwise.

Table 5 — temperature & power thresholds in practice

Card	Inlet temp	Warn temp	Hard throttle	Per-card power cap
H800A SXM5	<= 32°C	85°C	92°C	700W
H800A PCIe	<= 30°C	84°C	91°C	350W
A100 SXM4	<= 30°C	85°C	92°C	400W
4090	<= 30°C	83°C	88°C	450W

If inlet is over 35°C, talk to the data center, not NVIDIA — most "sudden performance drops" we see resolve to AC or airflow problems.

Table 6 — when to RMA

ecc.errors.uncorrected.volatile.total non-zero, ever
Any FAIL in dcgmi diag -r 3
XID 64 / 79 / 119 / 120 even once
Cumulative HBM retired pages > 64
Per-link NVLink CRC error rate > 1e-9

Hit any one → drain workloads, escalate to hardware ops. Alaya customers see these aggregated under Console → Node Health, with auto-tickets when thresholds are crossed.