Tech Blog
Field notes from the engineering team — inference tuning, kilo-GPU training, network topology, compliance, compute strategy, and cost case studies. Updated monthly.
May 20267 posts
InfiniBand Fabric for GPU Clusters — bandwidth, rail-aligned topology, and triage
Why LLM clusters insist on InfiniBand, how rail-aligned saves 30% training time, and the ibstat / ibdev2netdev / ib_write_bw routine that fixes most field issues.
NCCL-tests in practice — the mandatory pre-launch health check
A new cluster comes up. Don't run training yet. One sweep of NCCL-tests tells you whether it can host LLMs at all — here is how to read the busbw.
Faster HuggingFace model downloads — practical playbook
A 70B checkpoint is 140 GB. Direct from huggingface.co takes 4+ hours; over hf-mirror at line-rate 1 GbE it's 22 minutes. A working set of options for 2026.
GPU Monitoring & XID triage — six tables for on-call
What to read in nvidia-smi vs DCGM, what XID 79 / 31 / 119 actually mean, and when to RMA. A pocket reference for SREs and escalation engineers.
Replacing pip / poetry with uv — Python packaging + index setup for AI projects
pip takes 80s to install a vLLM stack, uv does it in 8s — and the lockfile is clean and reproducible. Here is the setup we ship to GPU customers.
Kubernetes for AI Engineers — concepts and a working minimum
Pod, Deployment, Job, PVC, StatefulSet — only the parts you actually use in GPU workflows. Skip the rest until you hit it.
Docker Essentials & Mirror Setup (2026)
Image pulls time out, Docker Hub rate-limits, public accelerators keep going dark — here is the config we are actually shipping to customers in 2026.
