GPU Monitoring & XID triage — six tables for on-call
What to read in nvidia-smi vs DCGM, what XID 79 / 31 / 119 actually mean, and when to RMA. A pocket reference for SREs and escalation engineers.
Context
Across customer fleets running H100 / A100 / 4090, we built a "what to look at first when something goes wrong" cheatsheet. The six tables below cover the must-watch metrics, the must-install tools, the most common XID errors, temperature and power thresholds, and when to RMA. Print and pin to the wall.
Table 1 — eight nvidia-smi fields that matter
nvidia-smi --query-gpu=index,name,temperature.gpu,utilization.gpu,memory.used,memory.total,power.draw,ecc.errors.uncorrected.volatile.total --format=csv| Field | Healthy range | Meaning |
|---|---|---|
temperature.gpu | H100 < 85°C / A100 < 88°C | Sustained breach triggers thermal throttling |
utilization.gpu | training >= 90% / inference workload-dependent | Training stuck < 70% = data or scheduling bottleneck |
memory.used | up to 95% of total | Leave 1–2 GB for NCCL buffers |
power.draw | H100 SXM <= 700W | Above this and BMC will pull voltage |
ecc.errors.uncorrected.volatile.total | must be 0 | Non-zero = drain workloads immediately |
pstate | P0 in training | P2 means it's idle |
Table 2 — metrics only DCGM can give you
nvidia-smi samples at 1 Hz, too coarse for production. Run dcgm-exporter and scrape these into Prometheus:
| Metric | Use |
|---|---|
DCGM_FI_PROF_SM_ACTIVE | True SM activity (vs utilization.gpu, which is occupancy) |
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE | Tensor Core utilization. < 30% = your op isn't going through TC |
DCGM_FI_PROF_DRAM_ACTIVE | HBM bandwidth utilization |
DCGM_FI_PROF_NVLINK_TX/RX_BYTES | NVLink traffic; one direction should stay < 900 GB/s |
DCGM_FI_DEV_XID_ERRORS | XID error count per second |
A common bad picture: GPU util 90% + Tensor Core 25% — you're running an op in fp32 that should be fp16.
Table 3 — common XID errors and what to do
| XID | Meaning | First-line action |
|---|---|---|
| 13 | Graphics Engine Exception | Driver bug or app OOB; restart process. Repeated → downgrade driver |
| 31 | MMU fault, address out of range | Usually user-code CUDA OOB, not hardware |
| 43 | GPU stopped processing | Process crashed or OOM; restart container |
| 63 | ECC page retirement | HBM correctable error, page now masked. Log and watch |
| 64 | ECC page retirement failure | Retirement failed → drain the node |
| 74 | NVLink error | Check nvidia-smi nvlink -e; multi-GPU at once → reboot node |
| 79 | GPU fallen off the bus | Hardware fault; if reboot doesn't help, RMA |
| 119 / 120 | Confidential Compute / unrecoverable HBM | Unrecoverable; drain, RMA |
Our policy in the field: 13/31/43 auto-retry once; 63 with running count > 5 → drain & flag; 64/79/119/120 → cordon node + escalate to hardware ops.
Table 4 — where the logs live
# Kernel
dmesg -T | grep -i -E "nvrm|nvidia|xid"
journalctl -k --since "1 hour ago" | grep -i nvidia
# Userspace (persistent)
/var/log/nvidia-installer.log
/var/log/syslog or /var/log/messages
# DCGM health checks
dcgmi diag -r 3 # ~30-min Lv3 sweep
dcgmi diag -r 1 -j > diag.json # 5-min quick checkAlways copy the timestamp of any XID into the ticket — RMA chains lose months otherwise.
Table 5 — temperature & power thresholds in practice
| Card | Inlet temp | Warn temp | Hard throttle | Per-card power cap |
|---|---|---|---|---|
| H100 SXM5 | <= 32°C | 85°C | 92°C | 700W |
| H100 PCIe | <= 30°C | 84°C | 91°C | 350W |
| A100 SXM4 | <= 30°C | 85°C | 92°C | 400W |
| 4090 | <= 30°C | 83°C | 88°C | 450W |
If inlet is over 35°C, talk to the data center, not NVIDIA — most "sudden performance drops" we see resolve to AC or airflow problems.
Table 6 — when to RMA
ecc.errors.uncorrected.volatile.totalnon-zero, ever- Any FAIL in
dcgmi diag -r 3 - XID 64 / 79 / 119 / 120 even once
- Cumulative HBM retired pages > 64
- Per-link NVLink CRC error rate > 1e-9
Hit any one → drain workloads, escalate to hardware ops. Alaya customers see these aggregated under Console → Node Health, with auto-tickets when thresholds are crossed.
Last updated on
Faster HuggingFace model downloads — practical playbook
A 70B checkpoint is 140 GB. Direct from huggingface.co takes 4+ hours; over hf-mirror at line-rate 1 GbE it's 22 minutes. A working set of options for 2026.
Replacing pip / poetry with uv — Python packaging + index setup for AI projects
pip takes 80s to install a vLLM stack, uv does it in 8s — and the lockfile is clean and reproducible. Here is the setup we ship to GPU customers.
