Pushing vLLM to 4500 tokens/s on H800A

A single 8×H800A node serving Qwen3-72B-Instruct (quantized). End-to-end notes on paged attention, continuous batching, and KV-cache hit-rate tuning — a full-stack throughput hunt.

Background

A finance customer needed to host Qwen3-72B-Instruct on their own H800A nodes for an internal Q&A bot. Targets: P95 first-token latency ≤ 800ms, 200 RPS sustained. We tuned a standard Alaya 8×H800A node (NVLink-connected, NVMe-cached weights).

Baseline

Stock vLLM 0.6.3 with 4-bit AWQ weights. ~38 GB per card occupied by weights, leaving ~40 GB for KV. Only flag set: --tensor-parallel-size 8. ShareGPT-4k offline benchmark:

Throughput: 1820 tokens/s
P95 TTFT: 1.4s
GPU utilization: ~62%

Lever 1 — continuous batching

vLLM is PagedAttention out of the box, but the default scheduler still aligns batches to request boundaries. Switching to --scheduler-policy continuous gives per-step scheduling:

Throughput: 1820 → 3120 tokens/s
P95 TTFT: 1.4s → 0.9s

Lever 2 — KV cache hit rate

Q&A traffic has a heavy shared system prompt (~800 tokens). Prefix caching:

--enable-prefix-caching --max-num-batched-tokens 16384

Cache hit rate climbed from 12% to 71%; TTFT dropped to 380ms.

Lever 3 — FP8 weights

AWQ 4-bit doesn't saturate H800A Tensor Cores. Switching to FP8 (per-token + per-channel scaling) on vLLM 0.7 with FP8 KV cache:

Throughput: 3120 → 4520 tokens/s
Per-card weight footprint: 38 → 31 GB
BLEU / ROUGE regression vs AWQ-4: < 0.3%

Takeaways

4500 tokens/s on a single 8×H800A box ≈ 390M tokens/day — one node replaces six A100 nodes. The order matters: start with scheduling (continuous batching) and caching (prefix caching), then pick precision (FP8). Reproducible on Alaya ALab and CCI; see "AI Model Inference → Inference".