Lifting multi-tenant isolation to Confidential Compute grade
MIG + Confidential VM, GID-pinned RDMA, NVMe cryptographic erase — the isolation stack we shipped so finance and government audits pass clean.
Context
Regulators keep tightening the screws, and our finance and government customers ask increasingly granular questions about how isolation actually works under the hood. "We use K8s namespaces" no longer cuts it — they want to see the full chain from silicon to NCCL collectives. Here's how we lifted Alaya NeW Cloud's multi-tenant isolation to Confidential Compute grade.
Problem: gaps in stock K8s multi-tenancy
- Pods from different namespaces on the same node share the GPU driver — VRAM residue can leak across tenants
- NCCL collectives go over RDMA without GID verification by default; a malicious tenant can spoof PD/QP and read the traffic
- Node-local NVMe caches aren't wiped, so the next tenant can recover the previous tenant's training data
Lever 1 — MIG + Confidential VM
H100 SXM5 MIG slices give you hardware-enforced VRAM/SM partitioning. We wrap each MIG instance in an NVIDIA Confidential VM — VRAM is encrypted the moment it leaves the chip, hypervisors only see ciphertext. Cross-tenant memory snooping is blocked at the silicon level.
Lever 2 — GID-based RDMA spoof prevention
For NCCL over RDMA we bind each tenant's QPs to a dedicated GID subnet and enable SR-IOV PF isolation on ConnectX-7. The NCCL topology file is pushed by the control plane; tenant processes never see another tenant's LID/GID:
NCCL_IB_GID_INDEX=3 NCCL_IB_HCA=mlx5_2:1,mlx5_3:1Bonus: cross-tenant AllReduce no longer interferes; tail latency variance drops from ±18% to ±3%.
Lever 3 — secure NVMe wipe on node reclaim
When an instance is released, the control plane forces an NVMe format --ses=2 (cryptographic erase) — average 11 seconds for a 7.68 TB drive, two orders of magnitude faster than zeroing with dd. The node is only marked schedulable after the wipe finishes.
Takeaways
The whole stack passed China's MLPS 2.0 Level 3 and a finance customer's pen test. In product terms: customers flip a "Security & Compliance Mode" toggle in the console; the surcharge is roughly +12% over standard instances, but it lets compliance audits pass cleanly — a hard requirement for finance and government workloads.
Last updated on
Network topology for kilo-GPU training — from Fat-Tree to Dragonfly+
Why does AllReduce tail latency scale non-linearly with cluster size? With NCCL topology-aware rewrites, we lifted a 1024-GPU cluster from 71% to 89% MFU.
A practitioner's guide to AI compute selection
From workload tiering and the accelerator quadrant to a five-axis vendor scorecard — upgrading from "TFLOPS per dollar" to "TCO per workload". A condensed 50-page guide.
