Alaya NeW Cloud

A practitioner's guide to AI compute selection

From workload tiering and the accelerator quadrant to a five-axis vendor scorecard — upgrading from "TFLOPS per dollar" to "TCO per workload". A condensed 50-page guide.

Why compute selection became a CXO problem

A decade ago buying GPUs looked like buying servers: count TFLOPS, compare prices, sign a contract. But once model sizes jumped from 7B to 671B and inference token volumes from millions to tens of billions, compute spend stopped being one line in the IT budget — it became a strategic operating variable that shapes cash flow, agility and compliance posture. This is a condensed version of the Alaya Intelligence Research "Practitioner's Guide to AI Compute Selection", with a decision framework you can lift directly.

Four guiding principles

  • Demand-driven: first answer "what business is this compute serving" — training / inference / experimentation — then pick a SKU. The most common waste is provisioning everything at training-grade specs, forcing inference and experiments to carry that cost.
  • Cost-effectiveness: not the price per card, but cost per useful compute. An H200 is ~30% more expensive than an A100, but its MFU on 70B training is 2×+, so total TCO ends up lower.
  • Long-term fit: hardware lifecycle is 4–5 years, model architecture cycle has compressed to 6–12 months. Selection must leave escape windows for the software stack, scheduler and model swaps.
  • Security & compliance: export controls, data classification, model registration, cross-border transfer. Compute is no longer "plug in and run" — it has to stay legal to run.

The four-step process

  1. Demand assessment: split workloads into steady-state (core model inference, production RAG), variable (seasonal training, event-driven inference spikes) and exploratory (algorithm trials, data prep). Each has different requirements for cost predictability, SLA and interruptibility.
  2. Resource shortlist: match accelerator + network + storage to the workload. Training pool — NVLink + IB. Inference pool — PCIe + RoCE. Experiment pool — consumer-grade GPUs on plain Ethernet are fine.
  3. Deployment model: self-build / long-term lease / public cloud on-demand / pooled compute cloud — these aren't alternatives, they're layers. Mix them by workload class.
  4. Five-axis vendor scoring: performance (measured MFU), reliability (annual failure rate / SLA refunds), cost (including hidden OPEX), service (response time, expert hours) and compliance (qualifications, data sovereignty, exit clauses).

The accelerator quadrant

  • Large-scale training: NVIDIA H200 / H800 / B200 with NVLink Switch + IB 400G. NVLink bandwidth and HBM capacity decide whether your model even fits.
  • High-throughput inference: L40S / H20 / domestic inference cards. Watch memory bandwidth and tokens-per-watt, not peak TFLOPS.
  • Edge / device: Jetson / Atlas / Ascend inference. Power budget, thermal envelope and ecosystem maturity matter more than raw flops.
  • Domestic alternatives: production-ready in specific scenarios, but budget 1–3 engineering months for software-stack adaptation. Don't plan as a 1:1 NVIDIA drop-in.

Don't let network or storage become the bottleneck

In kilo-GPU training, 30% of wall time often goes to AllReduce. Topology (Fat-Tree / Dragonfly+ / Rail-Optimized) and RDMA flavour (IB vs RoCEv2) decide MFU. For storage: training pools want GPU Direct + parallel filesystems (Lustre / GPFS) with object-store tiering for cold data; inference pools live or die by KV-cache and weight locality on local NVMe.

  • Chiplet + CPO (co-packaged optics) doubles per-card interconnect; 1.6T fabrics ship.
  • Hardware/software co-design: model architecture pulls chip design; FP8 / FP4 mainstream.
  • Agentic AI shifts demand from batch training to continuous inference + tool calls — scheduling gets harder.
  • Carbon-per-compute joins price as a procurement axis; "green compute" matters.
  • Unified compilers (UAI-MLIR & co.) finally let one codebase target heterogeneous backends.
  • Inference becomes commodity: token prices in fractions of a cent unlock a new application layer.

Five case studies, in one line each

  • Lab pool: turning H800s into a cross-team shared pool with quotas lifted utilization 31% → 74% — same budget, twice the research groups.
  • Startup app: avatar-generation SaaS on token-metered cloud killed the "standby fleet" cost — gross margin 23% → 51%.
  • Student LoRA: a 4090 on spot is enough for 7B fine-tuning. Data quality and hyperparameters matter more than card class.
  • Complex-valued NN: a niche research architecture shipped faster on domestic accelerators because vendor support was deeper.
  • Embodied VLA: hundred-card H800 elastic pool for training, edge + domestic inference for deployment — layering is the win.

Three actions for practitioners

  1. Split workloads into steady / variable / exploratory first. Pick supply per class. Don't spec everything at the strictest tier.
  2. Upgrade vendor evaluation from "list price" to "TCO + exit clause". The ability to switch vendor, scale down, or downgrade tier matters more than unit price.
  3. Bring compliance, export control and asset-disposal questions into D-1 of selection. Don't discover at D+12 months that your cards can't legally run.

The full guide runs ~50 pages and includes the complete framework, technical deep-dives and detailed case studies. This is the executive summary — refer to the full document for the metric formulas, layered model and five-axis scorecard.

Last updated on

Was this page helpful?

On this page