Alaya NeW Cloud

GPU Monitoring & XID triage — six tables for on-call

What to read in nvidia-smi vs DCGM, what XID 79 / 31 / 119 actually mean, and when to RMA. A pocket reference for SREs and escalation engineers.

Context

Across customer fleets running H100 / A100 / 4090, we built a "what to look at first when something goes wrong" cheatsheet. The six tables below cover the must-watch metrics, the must-install tools, the most common XID errors, temperature and power thresholds, and when to RMA. Print and pin to the wall.

Table 1 — eight nvidia-smi fields that matter

nvidia-smi --query-gpu=index,name,temperature.gpu,utilization.gpu,memory.used,memory.total,power.draw,ecc.errors.uncorrected.volatile.total --format=csv
FieldHealthy rangeMeaning
temperature.gpuH100 < 85°C / A100 < 88°CSustained breach triggers thermal throttling
utilization.gputraining >= 90% / inference workload-dependentTraining stuck < 70% = data or scheduling bottleneck
memory.usedup to 95% of totalLeave 1–2 GB for NCCL buffers
power.drawH100 SXM <= 700WAbove this and BMC will pull voltage
ecc.errors.uncorrected.volatile.totalmust be 0Non-zero = drain workloads immediately
pstateP0 in trainingP2 means it's idle

Table 2 — metrics only DCGM can give you

nvidia-smi samples at 1 Hz, too coarse for production. Run dcgm-exporter and scrape these into Prometheus:

MetricUse
DCGM_FI_PROF_SM_ACTIVETrue SM activity (vs utilization.gpu, which is occupancy)
DCGM_FI_PROF_PIPE_TENSOR_ACTIVETensor Core utilization. < 30% = your op isn't going through TC
DCGM_FI_PROF_DRAM_ACTIVEHBM bandwidth utilization
DCGM_FI_PROF_NVLINK_TX/RX_BYTESNVLink traffic; one direction should stay < 900 GB/s
DCGM_FI_DEV_XID_ERRORSXID error count per second

A common bad picture: GPU util 90% + Tensor Core 25% — you're running an op in fp32 that should be fp16.

Table 3 — common XID errors and what to do

XIDMeaningFirst-line action
13Graphics Engine ExceptionDriver bug or app OOB; restart process. Repeated → downgrade driver
31MMU fault, address out of rangeUsually user-code CUDA OOB, not hardware
43GPU stopped processingProcess crashed or OOM; restart container
63ECC page retirementHBM correctable error, page now masked. Log and watch
64ECC page retirement failureRetirement failed → drain the node
74NVLink errorCheck nvidia-smi nvlink -e; multi-GPU at once → reboot node
79GPU fallen off the busHardware fault; if reboot doesn't help, RMA
119 / 120Confidential Compute / unrecoverable HBMUnrecoverable; drain, RMA

Our policy in the field: 13/31/43 auto-retry once; 63 with running count > 5 → drain & flag; 64/79/119/120 → cordon node + escalate to hardware ops.

Table 4 — where the logs live

# Kernel
dmesg -T | grep -i -E "nvrm|nvidia|xid"
journalctl -k --since "1 hour ago" | grep -i nvidia

# Userspace (persistent)
/var/log/nvidia-installer.log
/var/log/syslog or /var/log/messages

# DCGM health checks
dcgmi diag -r 3                 # ~30-min Lv3 sweep
dcgmi diag -r 1 -j > diag.json  # 5-min quick check

Always copy the timestamp of any XID into the ticket — RMA chains lose months otherwise.

Table 5 — temperature & power thresholds in practice

CardInlet tempWarn tempHard throttlePer-card power cap
H100 SXM5<= 32°C85°C92°C700W
H100 PCIe<= 30°C84°C91°C350W
A100 SXM4<= 30°C85°C92°C400W
4090<= 30°C83°C88°C450W

If inlet is over 35°C, talk to the data center, not NVIDIA — most "sudden performance drops" we see resolve to AC or airflow problems.

Table 6 — when to RMA

  • ecc.errors.uncorrected.volatile.total non-zero, ever
  • Any FAIL in dcgmi diag -r 3
  • XID 64 / 79 / 119 / 120 even once
  • Cumulative HBM retired pages > 64
  • Per-link NVLink CRC error rate > 1e-9

Hit any one → drain workloads, escalate to hardware ops. Alaya customers see these aggregated under Console → Node Health, with auto-tickets when thresholds are crossed.

Last updated on

Was this page helpful?

On this page