GPU Virtualization Solutions and Implementation

This article summarizes the current implementation of the GPU passthrough solution on the 360 cloud platform and the verification of the container + MIG solution.

1. Background

As a key strategic goal of 360 Company, AI large models rely heavily on GPU cards, which are also considered strategic resources. If physical machines are directly allocated to users, a single physical machine typically comes with 8 GPU cards. However, users may not utilize all of them, leading to waste. Therefore, the GPU cards provided to users need to be finely divided, requiring allocation by card and an isolation mechanism.

KVM virtual machines and containers inherently possess the characteristics of resource subdivision and isolation. Thus, the team adopted virtual machines and containers as a means to provide GPU resources to meet user demands.

2. Solution Research

GPU Virtualization Solutions and Implementation

3. Solution Verification and Implementation

Documenting the principles and implementation methods of GPU passthrough and Docker + MIG solutions.

3.1 GPU Passthrough Solution

Principle Analysis

The GPU passthrough solution has good compatibility with GPUs, low performance overhead, and GPU manufacturers do not charge additional fees, making it widely used by major cloud providers.

GPU Virtualization Solutions and Implementation

Performance Overhead Chart

IOMMU

The main function of IOMMU is to complete mapping, which requires the use of page tables. The page table records the mapping relationship between GPA and HPA. The guest can only see GPA and writes data to GPA (automatically implemented by hardware).

Page table conversion rules: When a device initiates a DMA request, the guest includes its Source Identifier (containing Bus, Device, Func) in the request. IOMMU uses this identifier to point to the base address in the RTADDR_REG space, then finds the corresponding Context Entry in the Context Table using Bus, Device, Func, which is the starting address of the page table, and uses the page table to translate the device’s requested virtual address into a physical address.

GPU Virtualization Solutions and Implementation

The role of IOMMU:

  1. Establishes the mapping relationship from GPA to HPA, thereby shielding the guest from direct access to physical addresses, achieving physical address isolation.

  2. IOMMU can map contiguous virtual addresses to multiple non-contiguous physical memory segments. In the absence of IOMMU, the physical space accessed by devices must be contiguous, which IOMMU effectively resolves.

VFIO Driver

Virtual Function I/O (VFIO) is a modern device passthrough solution that fully utilizes the DMA Remapping and Interrupt Remapping features provided by VT-d/AMD-Vi technology, ensuring the DMA security of passthrough devices while achieving I/O performance close to that of physical devices. User-space processes can directly access hardware using the VFIO driver, and since the entire process is conducted under the protection of IOMMU, it is very secure, allowing even non-privileged users to use it directly. In other words, VFIO is a complete userspace driver solution because it can safely present device I/O, interrupts, DMA, and other capabilities to user space.

To achieve the highest I/O performance, virtual machines require the VFIO passthrough method, as it features low latency and high bandwidth, and guests can also directly use the native drivers of the devices. These excellent characteristics are attributed to VFIO’s application of the DMA Remapping and Interrupt Remapping mechanisms provided by VT-d/AMD-Vi. VFIO uses DMA Remapping to establish independent IOMMU Page Tables for each Domain, limiting the DMA access of passthrough devices to the address space of the Domain, ensuring the security of user-space DMA, and uses Interrupt Remapping to complete interrupt remapping and Interrupt Posting to achieve interrupt isolation and direct interrupt delivery.

Steps to Implement GPU Device Passthrough

1. The physical machine starts, and the BIOS completes the configuration of PCI devices, including initializing the config space and allocating BAR address space;

2. Since the kernel has IOMMU support enabled, it will allocate an IOMMU group for the current device;

3. Load the VFIO driver and associate it with the GPU card;

4. QEMU starts the virtual machine and passes the GPU card device to the virtual machine;

5. The virtual machine starts, initializing the OCI device according to the PCI topology constructed by QEMU, including configuring the config space and allocating BAR address space;

7. Most instructions are addressed through the page table established by IOMMU (converting GPA to HPA), while some special config registers, such as Max Payload, and virtual machine in/out instructions are addressed through VM exit;

Solution Implementation

Physical Machine Startup Configuration Adjustments

  • Ubuntu System

Enable IOMMU and configure the GPU network card with the VFIO driver

vim /etc/default/grubGRUB_CMDLINE_LINUX_DEFAULT="intel_iommu=on vfio-pci.ids=10de:xxxx quiet"

Upgrade grub:

update-grub
  • Longxin System

Enable IOMMU

grubby --update-kernel="/boot/vmlinuz-`uname -r`" --args="intel_iommu=on"

Configure the GPU network card with the VFIO driver

Add configuration file:

[root@hpclov2020 ~]# cat /etc/modules-load.d/openstack-gpu.confvfio_pci

Add configuration file:

[root@hpclov2020 ~]# cat /etc/modprobe.d/vfio.confoptions vfio-pci ids=10de:xxxx
  • Set Kernel Module Blacklist

Disable NVIDIA and Nouveau drivers

vim /etc/modprobe.d/blacklist.confblacklist nouveauoptions nouveau modeset=0blacklist xhci_hcdblacklist nvidiablacklist nvidia_modesetblacklist nvidia_drmblacklist snd_hda_intelblacklist nvidiafbblacklist astblacklist drm_kms_helperblacklist drm_vram_helperblacklist ttmblacklist drm
  • Verify Configuration Effectiveness

Verify IOMMU: dmesg | grep iommu

GPU Virtualization Solutions and Implementation

Verify GPU using VFIO driver:

lspci -nn | grep NVIDIA

lspci -s 3e:00.0 -k | grep driver

GPU Virtualization Solutions and Implementation

OpenStack Side Adjustments

  • nova-api configuration adjustments

Add configuration:

[pci]alias = {"vendor_id":"10de", "product_id":"xxx", "device_type":"type-PCI", "name":"nvidia-xxx"}
  • nova-compute configuration adjustments

Add configuration:

[pci]passthrough_whitelist = [{"vendor_id":"10de", "product_id":"xxx"}]
  • Create Flavors and Verify

Create trait:

openstack trait create CUSTOM_SHARE_GPU_XXX_HOST_LEVEL3

Set trait:

openstack  resource provider trait set  --trait CUSTOM_SHARE_GPU_XXX_HOST_LEVEL3 2a5769e2-78fa-4a15-8f36-9b82407c4b56

Create flavors:

# Four flavors for XXXopenstack flavor create --vcpus 10 --ram 102400 --ephemeral 800 --property trait:CUSTOM_SHARE_GPU_XXX_HOST_LEVEL3='required' --property pci_passthrough:alias='nvidia-xxx:1' v.xxxgn3i-1x.c10g100-1iopenstack flavor create --vcpus 20 --ram 204800 --ephemeral 1600 --property trait:CUSTOM_SHARE_GPU_XXX_HOST_LEVEL3='required' --property pci_passthrough:alias='nvidia-xxx:2' v.xxxgn3i-2x.c20g200-2iopenstack flavor create --vcpus 40 --ram 409600 --ephemeral 3200 --property trait:CUSTOM_SHARE_GPU_XXX_HOST_LEVEL3='required' --property pci_passthrough:alias='nvidia-xxx:4' v.xxxgn3i-4x.c40g400-4iopenstack flavor create --vcpus 80 --ram 819200 --ephemeral 6500 --property trait:CUSTOM_SHARE_GPU_XXX_HOST_LEVEL3='required' --property pci_passthrough:alias='nvidia-xxx:8' v.xxxgn3i-8x.c80g800-8i

Create virtual machines:

# zzzc2nova boot --availability-zone nova:hpctur02.aitc.xxx.xxx.net --flavor gpu-flavor --security-groups f9f068b4-f247-4e29-a21d-fc98da18e99f  --nic net-id=f0dd296e-dee9-422b-b538-3560fbe145f9 --block-device id=edc2a95c-6e73-4bf8-8891-4bb19d23ca94,source=image,dest=volume,bus=scsi,type=disk,size=50,bootindex=0,shutdown=remove,volume_type=sata  test_gpu_08
# zzdtnova boot --availability-zone nova:hpclov2017.aitc.xxx.xxx.net --flavor v.s2.80c800G.gpu --security-groups a37b6670-0fb6-48ba-90cd-468f0532bbe8  --nic net-id=0d683c7b-6275-4f1a-80da-28a52498d14a --block-device id=f592985e-5b34-4dbf-93ff-c9f86be7596a,source=image,dest=volume,bus=scsi,type=disk,size=50,bootindex=0,shutdown=remove,volume_type=sata  gpu_test_01
# shyc2nova boot --availability-zone nova:hpclov2016.aitc.xxx.xxx.net --flavor v.xxxgn3i-1x.c10g100-1i --security-groups 7730b14e-4f25-4828-939f-f2f094cd2ee9 --nic net-id=6d2395f0-2f95-48c1-ae5d-073b95637e0e --block-device id=2f9e989d-de4e-4e43-a56d-28105a4fd088,source=image,dest=volume,bus=scsi,type=disk,size=50,bootindex=0,shutdown=remove,volume_type=sata hpclov2016v.aitc.xxx.xxx.net
Bind FIP
neutron floatingip-associate ffc8c74b-299d-45e7-9780-19d90e703e69 74a6c1c9-9bff-43cc-8341-572276090fa2

Log in to the virtual machine to verify:

lspci -nn | grep -i nvidia

GPU Virtualization Solutions and Implementation

3.2 Docker + MIG Solution

Solution Implementation

Create MIG Instance

Enable MIG functionality on GPU device 0: nvidia-smi -i 0 -mig 1

Check if MIG is enabled on GPU device 0: nvidia-smi -i 0 –query-gpu=pci.bus_id,mig.mode.current –format=csv

View MIG profiles: nvidia-smi mig -lgip

Create MIG: nvidia-smi mig -cgi 14,14,14,14 -C

View allocated instances: nvidia-smi -L

Install NVIDIA Driver

APQA Cloud Disk – /foxi/Virtualization/vGPU/ (buduanwang.vip)

yum install dkmscurl -SL https://foxi.buduanwang.vip/pan/foxi/Virtualization/vGPU/NVIDIA-Linux-x86_64-510.85.03-vgpu-kvm.run -Osh NVIDIA-Linux-x86_64-510.85.03-vgpu-kvm.run --dkms

Install NVIDIA Container Toolkit

#vim /etc/yum.repos.d/nvidia-container-toolkit.repo[nvidia-container-toolkit]name=nvidia-container-toolkitbaseurl=https://nvidia.github.io/libnvidia-container/stable/rpm/$basearchrepo_gpgcheck=1gpgcheck=0enabled=1gpgkey=https://nvidia.github.io/libnvidia-container/gpgkeysslverify=1sslcacert=/etc/pki/tls/certs/ca-bundle.crt
[nvidia-container-toolkit-experimental]name=nvidia-container-toolkit-experimentalbaseurl=https://nvidia.github.io/libnvidia-container/experimental/rpm/$basearchrepo_gpgcheck=1gpgcheck=0enabled=0gpgkey=https://nvidia.github.io/libnvidia-container/gpgkeysslverify=1sslcacert=/etc/pki/tls/certs/ca-bundle.crt
# Installyum install -y nvidia-container-toolkit

Configure Docker Runtime

nvidia-ctk runtime configure --runtime=dockersystemctl restart docker

Pull CUDA Image

docker pull harbor.qihoo.net/nvidia/cuda:11.4.1-base-centos8

Run Container and View MIG Devices

sudo docker run --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=MIG-a75f333a-f233-5ba4-b4c5-65598abb8f33 harbor.qihoo.net/nvidia/cuda:11.4.1-base-centos8 nvidia-smi

4. Summary and Outlook

The 360 Zhihui Cloud platform has successfully implemented GPU passthrough and Docker + MIG solutions, but further exploration is needed for the VGPU solution. Once the VGPU solution is completed, it can provide users with comprehensive GPU virtualization capabilities.

Leave a Comment