HyperTrain
Managed distributed training for large models — templated, observable, fault-tolerant
HyperTrain is the managed distributed training platform on Alaya NeW Cloud. Submit PyTorch / DeepSpeed / Megatron jobs through templates; the platform handles topology-aware scheduling, checkpoint resume, telemetry, and fault recovery.
Why HyperTrain
- Templates — first-class support for PyTorch DDP, DeepSpeed ZeRO-3, Megatron-LM
- Topology-aware — nodes packed by NVLink / IB domain, no cross-domain comms bottleneck
- Resume-on-failure — automatic checkpointing, transparent restart on node loss or spot eviction
- Observability — loss, grad-norm, throughput, inter-GPU bandwidth out of the box
Get started
Create a training job
Pick framework, resources, storage, image — submit
Job detail
Basics, config, pods, monitoring, logs
Pod detail
Access, containers, volumes, scheduling, events
Job management
Pause, restart, copy, delete
Templates
Create, reuse, and save as template
Relationship with VKS
HyperTrain runs on top of VKS — you can submit through the HyperTrain UI/API (recommended) or write your own Kubernetes YAML directly to VKS for full flexibility.
Last updated on
Was this page helpful?
