Alaya NeW Cloud

Kubernetes for AI Engineers — concepts and a working minimum

Pod, Deployment, Job, PVC, StatefulSet — only the parts you actually use in GPU workflows. Skip the rest until you hit it.

Why this exists

About 70% of GPU-cloud customers' engineers have never written K8s, but the moment you orchestrate training, ship inference, or run batch jobs you can't avoid it. Below is the minimum I keep teaching in customer onboarding sessions — memorize this and you can ship; learn the rest after you hit a wall.

1. The mental model that fits on a napkin

Pod         → smallest scheduling unit, one or more containers sharing net/storage
Deployment  → rolling-managed pool of stateless Pods (use this for inference)
StatefulSet → stable names + stable volumes (distributed-training workers)
Job         → run-once-and-finish (one-off finetune, batch inference)
CronJob     → a Job on a schedule
Service     → stable IP / DNS in front of a group of Pods
Ingress     → external HTTP routing
PVC + PV    → request and the actual persistent volume
ConfigMap   → inject configuration
Secret      → inject credentials (base64, not encryption)
Namespace   → soft tenancy + quota boundary

One-liner to keep: K8s scales "things one machine can do" into "a fleet declaratively doing the same thing" — you write YAML for the desired state, controllers reconcile reality.

2. The eight kubectl commands

kubectl get pods -A                  # list all pods
kubectl describe pod <p>             # why is it Pending / CrashLooping
kubectl logs -f <p> [-c <container>] # tail logs
kubectl exec -it <p> -- bash         # shell in
kubectl apply -f xxx.yaml            # declarative submit
kubectl delete -f xxx.yaml           # remove
kubectl port-forward <p> 8080:8080   # local tunnel
kubectl top pod                      # resource usage (needs metrics-server)

90% of on-site triage: describe for Events → logs for stderr → exec to poke at the runtime.

3. Minimal training Job

Single-GPU finetune:

apiVersion: batch/v1
kind: Job
metadata:
  name: lora-sft-qwen
spec:
  backoffLimit: 0
  template:
    spec:
      restartPolicy: Never
      nodeSelector:
        nvidia.com/gpu.product: H100-SXM5
      containers:
      - name: trainer
        image: registry.alayanew.com/llamafactory:0.9
        command: ["bash", "-lc", "llamafactory-cli train cfg.yaml"]
        resources:
          limits:
            nvidia.com/gpu: 1
            cpu: "16"
            memory: 128Gi
        volumeMounts:
        - { name: data,  mountPath: /data }
        - { name: model, mountPath: /model }
      volumes:
      - name: data
        persistentVolumeClaim: { claimName: pvc-data }
      - name: model
        persistentVolumeClaim: { claimName: pvc-model }

Notes:

  • nvidia.com/gpu must go in limits; requests is auto-set equal. Putting it in requests is rejected by the admission controller.
  • Use a Job for training, not a Deployment — Deployment will respawn forever on failure.
  • backoffLimit: 0 stops after the first failure, easier to triage.

4. Minimal inference Deployment

apiVersion: apps/v1
kind: Deployment
metadata: { name: vllm-qwen }
spec:
  replicas: 2
  selector: { matchLabels: { app: vllm-qwen } }
  template:
    metadata: { labels: { app: vllm-qwen } }
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:v0.7.2
        args:
        - --model=/model/qwen3-72b
        - --tensor-parallel-size=4
        resources:
          limits: { nvidia.com/gpu: 4 }
        ports: [{ containerPort: 8000 }]
        readinessProbe:
          httpGet: { path: /health, port: 8000 }
          initialDelaySeconds: 60
          periodSeconds: 5
---
apiVersion: v1
kind: Service
metadata: { name: vllm-qwen }
spec:
  selector: { app: vllm-qwen }
  ports: [{ port: 80, targetPort: 8000 }]

Always set readinessProbe. Without it, rolling updates send traffic to a Pod that hasn't loaded the model yet — instant 503s.

5. Pitfalls

  1. Bad GPU placement: stock scheduler only counts nvidia.com/gpu, not NVLink topology. For LLM training, set topologyManager policy=single-numa-node, or use the GPU-aware scheduler that ships with Alaya.
  2. PVC contention: ReadWriteOnce volumes bind one node — cross-node pods can't start. Multi-worker datasets need ReadWriteMany (CephFS / NAS).
  3. Silent OOMKilled: a Pod just exits 137; you only see "Reason: OOMKilled" in describe. Always check Events first.
  4. HostPort vs GPU: same hostPort on same node is mutually exclusive — can leave half a GPU idle. Prefer Service.

Last updated on

Was this page helpful?

On this page