Pushing vLLM to 4500 tokens/s on H100
A single 8×H100 node serving Qwen3-72B-Instruct (quantized). End-to-end notes on paged attention, continuous batching, and KV-cache hit-rate tuning — a full-stack throughput hunt.
Background
A finance customer needed to host Qwen3-72B-Instruct on their own H100 nodes for an internal Q&A bot. Targets: P95 first-token latency ≤ 800ms, 200 RPS sustained. We tuned a standard Alaya 8×H100 node (NVLink-connected, NVMe-cached weights).
Baseline
Stock vLLM 0.6.3 with 4-bit AWQ weights. ~38 GB per card occupied by weights, leaving ~40 GB for KV. Only flag set: --tensor-parallel-size 8. ShareGPT-4k offline benchmark:
- Throughput: 1820 tokens/s
- P95 TTFT: 1.4s
- GPU utilization: ~62%
Lever 1 — continuous batching
vLLM is PagedAttention out of the box, but the default scheduler still aligns batches to request boundaries. Switching to --scheduler-policy continuous gives per-step scheduling:
- Throughput: 1820 → 3120 tokens/s
- P95 TTFT: 1.4s → 0.9s
Lever 2 — KV cache hit rate
Q&A traffic has a heavy shared system prompt (~800 tokens). Prefix caching:
--enable-prefix-caching --max-num-batched-tokens 16384Cache hit rate climbed from 12% to 71%; TTFT dropped to 380ms.
Lever 3 — FP8 weights
AWQ 4-bit doesn't saturate H100 Tensor Cores. Switching to FP8 (per-token + per-channel scaling) on vLLM 0.7 with FP8 KV cache:
- Throughput: 3120 → 4520 tokens/s
- Per-card weight footprint: 38 → 31 GB
- BLEU / ROUGE regression vs AWQ-4: < 0.3%
Takeaways
4500 tokens/s on a single 8×H100 box ≈ 390M tokens/day — one node replaces six A100 nodes. The order matters: start with scheduling (continuous batching) and caching (prefix caching), then pick precision (FP8). Reproducible on Alaya ALab and CCI; see "AI Model Inference → Inference".
Last updated on
Docker Essentials & Mirror Setup (2026)
Image pulls time out, Docker Hub rate-limits, public accelerators keep going dark — here is the config we are actually shipping to customers in 2026.
Network topology for kilo-GPU training — from Fat-Tree to Dragonfly+
Why does AllReduce tail latency scale non-linearly with cluster size? With NCCL topology-aware rewrites, we lifted a 1024-GPU cluster from 71% to 89% MFU.
