Alaya NeW Cloud

HyperTrain

Managed distributed training for large models — templated, observable, fault-tolerant

HyperTrain is the managed distributed training platform on Alaya NeW Cloud. Submit PyTorch / DeepSpeed / Megatron jobs through templates; the platform handles topology-aware scheduling, checkpoint resume, telemetry, and fault recovery.

Why HyperTrain

  • Templates — first-class support for PyTorch DDP, DeepSpeed ZeRO-3, Megatron-LM
  • Topology-aware — nodes packed by NVLink / IB domain, no cross-domain comms bottleneck
  • Resume-on-failure — automatic checkpointing, transparent restart on node loss or spot eviction
  • Observability — loss, grad-norm, throughput, inter-GPU bandwidth out of the box

Get started

Relationship with VKS

HyperTrain runs on top of VKS — you can submit through the HyperTrain UI/API (recommended) or write your own Kubernetes YAML directly to VKS for full flexibility.

Last updated on

Was this page helpful?

On this page