HyperTrain

Managed distributed training for large models — templated, observable, fault-tolerant

HyperTrain is the managed distributed training platform on Alaya NeW Cloud. Submit PyTorch / DeepSpeed / Megatron jobs through templates; the platform handles topology-aware scheduling, checkpoint resume, telemetry, and fault recovery.

Why HyperTrain

Templates — first-class support for PyTorch DDP, DeepSpeed ZeRO-3, Megatron-LM
Topology-aware — nodes packed by NVLink / IB domain, no cross-domain comms bottleneck
Resume-on-failure — automatic checkpointing, transparent restart on node loss or spot eviction
Observability — loss, grad-norm, throughput, inter-GPU bandwidth out of the box

Get started

Create a training job

Pick framework, resources, storage, image — submit

Job detail

Basics, config, pods, monitoring, logs

Pod detail

Access, containers, volumes, scheduling, events

Job management

Pause, restart, copy, delete

Templates

Create, reuse, and save as template

Relationship with VKS

HyperTrain runs on top of VKS — you can submit through the HyperTrain UI/API (recommended) or write your own Kubernetes YAML directly to VKS for full flexibility.