A practitioner's guide to AI compute selection
From workload tiering and the accelerator quadrant to a five-axis vendor scorecard — upgrading from "TFLOPS per dollar" to "TCO per workload". A condensed 50-page guide.
Why compute selection became a CXO problem
A decade ago buying GPUs looked like buying servers: count TFLOPS, compare prices, sign a contract. But once model sizes jumped from 7B to 671B and inference token volumes from millions to tens of billions, compute spend stopped being one line in the IT budget — it became a strategic operating variable that shapes cash flow, agility and compliance posture. This is a condensed version of the Alaya Intelligence Research "Practitioner's Guide to AI Compute Selection", with a decision framework you can lift directly.
Four guiding principles
- Demand-driven: first answer "what business is this compute serving" — training / inference / experimentation — then pick a SKU. The most common waste is provisioning everything at training-grade specs, forcing inference and experiments to carry that cost.
- Cost-effectiveness: not the price per card, but cost per useful compute. An H200 is ~30% more expensive than an A100, but its MFU on 70B training is 2×+, so total TCO ends up lower.
- Long-term fit: hardware lifecycle is 4–5 years, model architecture cycle has compressed to 6–12 months. Selection must leave escape windows for the software stack, scheduler and model swaps.
- Security & compliance: export controls, data classification, model registration, cross-border transfer. Compute is no longer "plug in and run" — it has to stay legal to run.
The four-step process
- Demand assessment: split workloads into steady-state (core model inference, production RAG), variable (seasonal training, event-driven inference spikes) and exploratory (algorithm trials, data prep). Each has different requirements for cost predictability, SLA and interruptibility.
- Resource shortlist: match accelerator + network + storage to the workload. Training pool — NVLink + IB. Inference pool — PCIe + RoCE. Experiment pool — consumer-grade GPUs on plain Ethernet are fine.
- Deployment model: self-build / long-term lease / public cloud on-demand / pooled compute cloud — these aren't alternatives, they're layers. Mix them by workload class.
- Five-axis vendor scoring: performance (measured MFU), reliability (annual failure rate / SLA refunds), cost (including hidden OPEX), service (response time, expert hours) and compliance (qualifications, data sovereignty, exit clauses).
The accelerator quadrant
- Large-scale training: NVIDIA H200 / H800 / B200 with NVLink Switch + IB 400G. NVLink bandwidth and HBM capacity decide whether your model even fits.
- High-throughput inference: L40S / H20 / domestic inference cards. Watch memory bandwidth and tokens-per-watt, not peak TFLOPS.
- Edge / device: Jetson / Atlas / Ascend inference. Power budget, thermal envelope and ecosystem maturity matter more than raw flops.
- Domestic alternatives: production-ready in specific scenarios, but budget 1–3 engineering months for software-stack adaptation. Don't plan as a 1:1 NVIDIA drop-in.
Don't let network or storage become the bottleneck
In kilo-GPU training, 30% of wall time often goes to AllReduce. Topology (Fat-Tree / Dragonfly+ / Rail-Optimized) and RDMA flavour (IB vs RoCEv2) decide MFU. For storage: training pools want GPU Direct + parallel filesystems (Lustre / GPFS) with object-store tiering for cold data; inference pools live or die by KV-cache and weight locality on local NVMe.
Six trends shaping 2026
- Chiplet + CPO (co-packaged optics) doubles per-card interconnect; 1.6T fabrics ship.
- Hardware/software co-design: model architecture pulls chip design; FP8 / FP4 mainstream.
- Agentic AI shifts demand from batch training to continuous inference + tool calls — scheduling gets harder.
- Carbon-per-compute joins price as a procurement axis; "green compute" matters.
- Unified compilers (UAI-MLIR & co.) finally let one codebase target heterogeneous backends.
- Inference becomes commodity: token prices in fractions of a cent unlock a new application layer.
Five case studies, in one line each
- Lab pool: turning H800s into a cross-team shared pool with quotas lifted utilization 31% → 74% — same budget, twice the research groups.
- Startup app: avatar-generation SaaS on token-metered cloud killed the "standby fleet" cost — gross margin 23% → 51%.
- Student LoRA: a 4090 on spot is enough for 7B fine-tuning. Data quality and hyperparameters matter more than card class.
- Complex-valued NN: a niche research architecture shipped faster on domestic accelerators because vendor support was deeper.
- Embodied VLA: hundred-card H800 elastic pool for training, edge + domestic inference for deployment — layering is the win.
Three actions for practitioners
- Split workloads into steady / variable / exploratory first. Pick supply per class. Don't spec everything at the strictest tier.
- Upgrade vendor evaluation from "list price" to "TCO + exit clause". The ability to switch vendor, scale down, or downgrade tier matters more than unit price.
- Bring compliance, export control and asset-disposal questions into D-1 of selection. Don't discover at D+12 months that your cards can't legally run.
The full guide runs ~50 pages and includes the complete framework, technical deep-dives and detailed case studies. This is the executive summary — refer to the full document for the metric formulas, layered model and five-axis scorecard.
Last updated on
Lifting multi-tenant isolation to Confidential Compute grade
MIG + Confidential VM, GID-pinned RDMA, NVMe cryptographic erase — the isolation stack we shipped so finance and government audits pass clean.
A CXO guide to reducing AI compute cost
Why does cloud migration sometimes push bills up 40%? Why does long-run GPU utilization sit below 30%? A CXO TCC framework, seven KPIs and five industry case studies (top single-point savings 60%).
