Kubernetes for AI Engineers — concepts and a working minimum
Pod, Deployment, Job, PVC, StatefulSet — only the parts you actually use in GPU workflows. Skip the rest until you hit it.
Why this exists
About 70% of GPU-cloud customers' engineers have never written K8s, but the moment you orchestrate training, ship inference, or run batch jobs you can't avoid it. Below is the minimum I keep teaching in customer onboarding sessions — memorize this and you can ship; learn the rest after you hit a wall.
1. The mental model that fits on a napkin
Pod → smallest scheduling unit, one or more containers sharing net/storage
Deployment → rolling-managed pool of stateless Pods (use this for inference)
StatefulSet → stable names + stable volumes (distributed-training workers)
Job → run-once-and-finish (one-off finetune, batch inference)
CronJob → a Job on a schedule
Service → stable IP / DNS in front of a group of Pods
Ingress → external HTTP routing
PVC + PV → request and the actual persistent volume
ConfigMap → inject configuration
Secret → inject credentials (base64, not encryption)
Namespace → soft tenancy + quota boundaryOne-liner to keep: K8s scales "things one machine can do" into "a fleet declaratively doing the same thing" — you write YAML for the desired state, controllers reconcile reality.
2. The eight kubectl commands
kubectl get pods -A # list all pods
kubectl describe pod <p> # why is it Pending / CrashLooping
kubectl logs -f <p> [-c <container>] # tail logs
kubectl exec -it <p> -- bash # shell in
kubectl apply -f xxx.yaml # declarative submit
kubectl delete -f xxx.yaml # remove
kubectl port-forward <p> 8080:8080 # local tunnel
kubectl top pod # resource usage (needs metrics-server)90% of on-site triage: describe for Events → logs for stderr → exec to poke at the runtime.
3. Minimal training Job
Single-GPU finetune:
apiVersion: batch/v1
kind: Job
metadata:
name: lora-sft-qwen
spec:
backoffLimit: 0
template:
spec:
restartPolicy: Never
nodeSelector:
nvidia.com/gpu.product: H100-SXM5
containers:
- name: trainer
image: registry.alayanew.com/llamafactory:0.9
command: ["bash", "-lc", "llamafactory-cli train cfg.yaml"]
resources:
limits:
nvidia.com/gpu: 1
cpu: "16"
memory: 128Gi
volumeMounts:
- { name: data, mountPath: /data }
- { name: model, mountPath: /model }
volumes:
- name: data
persistentVolumeClaim: { claimName: pvc-data }
- name: model
persistentVolumeClaim: { claimName: pvc-model }Notes:
nvidia.com/gpumust go in limits; requests is auto-set equal. Putting it in requests is rejected by the admission controller.- Use a
Jobfor training, not a Deployment — Deployment will respawn forever on failure. backoffLimit: 0stops after the first failure, easier to triage.
4. Minimal inference Deployment
apiVersion: apps/v1
kind: Deployment
metadata: { name: vllm-qwen }
spec:
replicas: 2
selector: { matchLabels: { app: vllm-qwen } }
template:
metadata: { labels: { app: vllm-qwen } }
spec:
containers:
- name: vllm
image: vllm/vllm-openai:v0.7.2
args:
- --model=/model/qwen3-72b
- --tensor-parallel-size=4
resources:
limits: { nvidia.com/gpu: 4 }
ports: [{ containerPort: 8000 }]
readinessProbe:
httpGet: { path: /health, port: 8000 }
initialDelaySeconds: 60
periodSeconds: 5
---
apiVersion: v1
kind: Service
metadata: { name: vllm-qwen }
spec:
selector: { app: vllm-qwen }
ports: [{ port: 80, targetPort: 8000 }]Always set readinessProbe. Without it, rolling updates send traffic to a Pod that hasn't loaded the model yet — instant 503s.
5. Pitfalls
- Bad GPU placement: stock scheduler only counts
nvidia.com/gpu, not NVLink topology. For LLM training, settopologyManagerpolicy=single-numa-node, or use the GPU-aware scheduler that ships with Alaya. - PVC contention:
ReadWriteOncevolumes bind one node — cross-node pods can't start. Multi-worker datasets needReadWriteMany(CephFS / NAS). - Silent OOMKilled: a Pod just exits 137; you only see "Reason: OOMKilled" in
describe. Always check Events first. - HostPort vs GPU: same hostPort on same node is mutually exclusive — can leave half a GPU idle. Prefer Service.
Last updated on
Replacing pip / poetry with uv — Python packaging + index setup for AI projects
pip takes 80s to install a vLLM stack, uv does it in 8s — and the lockfile is clean and reproducible. Here is the setup we ship to GPU customers.
Docker Essentials & Mirror Setup (2026)
Image pulls time out, Docker Hub rate-limits, public accelerators keep going dark — here is the config we are actually shipping to customers in 2026.
