GPU Virtualization Scheduling in the Era of Large Models: A Deep Game Between Hardware Genes and Software Strategies

1. Hardware Isolation of Nvidia MIG: From Architectural Design to Engineering Compromise

MIG implements 7 independent instances in A100/H100, each exclusively occupying32-512 SM units, 1.8-24GB of VRAM, and independent L2 cache slices. Hardware-level isolation is achieved throughinstance ID tags: the instruction queue, register file, and memory controller access of each SM carry the instance ID, and out-of-bounds access directly triggers hardware exceptions. This design reduces the memory out-of-bounds failure rate by two orders of magnitude, but at the cost offixed instance ratios (e.g., 1 large and 6 small), which cannot be dynamically adjusted. In a certain autonomous driving training scenario, the memory fragmentation rate of MIG instances reached 15-20%, while the fragmentation rate of container virtualization solutions was < 5%.

2. Kernel-Level Breakthroughs in Cloud Vendor Container Solutions: From API Hijacking to Context Management

Tencent Cloud’s qGPU context sandbox maintains independentGPU contexts for each container at the kernel driver level, including register states, CUDA stream queues, and VRAM mapping tables. When the kernel of container A times out, qGPU usespreemptive interrupts (based on NVLink hardware signals) to forcibly pause, saving the context to shared memory (with a delay of < 10μs) and switching to the context of container B. This mechanism achieves QoS accuracy at the 1ms level, which is 100 times faster than Alibaba Cloud’s cGPU non-preemptive scheduling (which relies on tasks naturally ending).
Alibaba Cloud’s cGPU memory water level control intercepts<span><span>cudaMalloc</span></span> through a kernel module, settingsoft memory quotas for each container (e.g., limiting inference containers to 2GB). When actual usage exceeds the 80% threshold, it triggerslazy reclamation: marking unused page tables for priority reuse during the next allocation. In a certain e-commerce recommendation scenario, this strategy improved VRAM utilization by 35%, but due to a lack of hardware support, there is a 10-15% risk of overuse.

3. Divergence in Routes Between Hardware Vendors and Cloud Vendors: Scene-Defined Technology

Dimension	MIG (Hardware Partitioning)	qGPU/cGPU (Software Time Division)
Isolation Granularity	Hardware resource slices (SM / VRAM / controllers)	Process-level context + VRAM quotas
Failure Impact	Single instance crash does not affect the entire card	Malicious programs may exhaust shared caches
Scheduling Flexibility	Fixed number of instances, requires restart for adjustment	Dynamic scaling, effective in seconds
Performance Loss	No API hijacking, computation delay < 1%	Kernel hijacking introduces an additional overhead of 5-8%
Applicable Scenarios	Financial / Medical strong isolation training	High-density inference / mixed workloads in the Internet

Data from a leading cloud vendor: in inference scenarios, the number of instances supported by container solutions per card is 3-5 times that of MIG, but requires additional deployment ofcache monitoring systems (e.g., PMU-based L2 Cache occupancy statistics), reducing performance interference through scheduling strategies (e.g., prioritizing the allocation of similar containers to the same card).

4. Virtualization Adaptation of Heterogeneous Architectures: Underlying Differences Between x86 and ARM

Limitations of AMD Instinct’s SRIOV Although it supports PCI-E VF, theshared L3 cache inside the GPU (e.g., 128MB in MI250) is not partitioned, and multiple VFs’ computational tasks may experience performance degradation of over 40% due to cache contention. A certain supercomputing center implementedcache affinity scheduling (binding similar task VFs to the same CCX), controlling performance fluctuations within 10%.
Interrupt Virtualization Optimization in ARM Architecture Huawei’s Ascend AI chip, based on ARMv9, implements fast container-level interrupt switching usingvirtualized context registers in the SVE2 instruction set extension. Compared to x86’s vAPIC, the ARM solution reduces the number of context saves for registers from 8μs to 3μs, making it more suitable for rapid response in high-frequency inference tasks.

5. Evolution Directions of GPU Virtualization

Hardware-Level Time Division: Time Slice Scheduling Units Nvidia’s H200 introducesMIG 3.0, which addstime slice quota registers, allowing for the allocation of computing power percentages for each instance (e.g., instance A 30%, instance B 70%), achieving microsecond-level task switching through hardware counters. This blurs the lines between partitioning and time division, increasing the number of instances supported per card by 50% while maintaining hardware isolation.
Software-Defined Hardware: Open Virtualization Interfaces The domestic Wallen BR100 chip opensVRAM partitioning registers, allowing cloud vendors to dynamically allocate VRAM blocks (e.g., 2GB+4GB+8GB) through APIs, achieving finer-grained resource allocation in conjunction with container scheduling. A certain AI company found that VRAM allocation efficiency improved by 40% compared to fixed MIG modes.
QoS Enhancement in Mixed Workload Scheduling ByteDance’s Volcano Engine’sGPU water level prediction model, based on historical task CUDA stream features (e.g., parallelism, VRAM access patterns), predicts resource conflicts 50ms in advance and dynamically adjusts container priorities. In mixed training and inference scenarios, this reduced inference latency fluctuations by 70%.

6. Practical Implementation of Computing Power Centers: ROI-Oriented Solution Selection

Cost-Sensitive Scenarios: Prioritize container virtualization A certain Internet company’s inference cluster uses qGPU, deploying 30 container instances per card, saving 60% in hardware costs compared to MIG solutions. Accompanying this is the implementation ofself-healing strategies (e.g., automatic migration when a container OOM occurs, recovering within 10 seconds), ensuring SLA compliance rates > 99.9%.
Safety-Critical Scenarios: Prioritize hardware isolation A certain bank’s model training platform uses MIG, with each tenant exclusively occupying one MIG instance, implementingmemory encryption (e.g., AES-256) andoperation auditing (recording the instruction flow of each SM) to meet compliance requirements 2.0. Although costs increase by 30%, it mitigates the risk of data leakage.
Mixed Scenarios: Hierarchical scheduling architecture Tencent Cloud’s TKE featuresthree-tier scheduling: the top layer divides training/inference zones using MIG, the middle layer implements instance reuse through container virtualization, and the bottom layer controls CUDA stream priorities (e.g., low priority for training tasks to ensure inference latency). After application, a certain automotive company’s intelligent computing center increased overall computing power utilization from 45% to 75%.

Conclusion: The Essence of GPU Virtualization is “Deconstruction and Reconstruction of Capabilities”

From the rigid hardware partitioning of MIG to the flexible software scheduling of qGPU, the evolution of GPU virtualization has always revolved around “how to maximize the value of every unit of computing power.” The future computing power center will be a deep integration of hardware slicing (MIG), software time division (containers), and intelligent scheduling (AI strategies) — this is not only a technical issue but also a precise match of business scenarios and cost models. When a certain cloud vendor doubles the daily revenue per card through a hybrid solution, the commercial value of technology truly materializes.