Lifting multi-tenant isolation to Confidential Compute grade

MIG + Confidential VM, GID-pinned RDMA, NVMe cryptographic erase — the isolation stack we shipped so finance and government audits pass clean.

Context

Regulators keep tightening the screws, and our finance and government customers ask increasingly granular questions about how isolation actually works under the hood. "We use K8s namespaces" no longer cuts it — they want to see the full chain from silicon to NCCL collectives. Here's how we lifted Alaya NeW Cloud's multi-tenant isolation to Confidential Compute grade.

Problem: gaps in stock K8s multi-tenancy

Pods from different namespaces on the same node share the GPU driver — VRAM residue can leak across tenants
NCCL collectives go over RDMA without GID verification by default; a malicious tenant can spoof PD/QP and read the traffic
Node-local NVMe caches aren't wiped, so the next tenant can recover the previous tenant's training data

Lever 1 — MIG + Confidential VM

H100 SXM5 MIG slices give you hardware-enforced VRAM/SM partitioning. We wrap each MIG instance in an NVIDIA Confidential VM — VRAM is encrypted the moment it leaves the chip, hypervisors only see ciphertext. Cross-tenant memory snooping is blocked at the silicon level.

Lever 2 — GID-based RDMA spoof prevention

For NCCL over RDMA we bind each tenant's QPs to a dedicated GID subnet and enable SR-IOV PF isolation on ConnectX-7. The NCCL topology file is pushed by the control plane; tenant processes never see another tenant's LID/GID:

NCCL_IB_GID_INDEX=3 NCCL_IB_HCA=mlx5_2:1,mlx5_3:1

Bonus: cross-tenant AllReduce no longer interferes; tail latency variance drops from ±18% to ±3%.

Lever 3 — secure NVMe wipe on node reclaim

When an instance is released, the control plane forces an NVMe format --ses=2 (cryptographic erase) — average 11 seconds for a 7.68 TB drive, two orders of magnitude faster than zeroing with dd. The node is only marked schedulable after the wipe finishes.

Takeaways

The whole stack passed China's MLPS 2.0 Level 3 and a finance customer's pen test. In product terms: customers flip a "Security & Compliance Mode" toggle in the console; the surcharge is roughly +12% over standard instances, but it lets compliance audits pass cleanly — a hard requirement for finance and government workloads.