This article summarizes the current implementation of the GPU passthrough solution on the 360 cloud platform and the verification of the container + MIG solution.
1. Background
As a key strategic goal of 360 Company, AI large models rely heavily on GPU cards, which are also considered strategic resources. If physical machines are directly allocated to users, a single physical machine typically comes with 8 GPU cards. However, users may not utilize all of them, leading to waste. Therefore, the GPU cards provided to users need to be finely divided, requiring allocation by card and an isolation mechanism.
KVM virtual machines and containers inherently possess the characteristics of resource subdivision and isolation. Thus, the team adopted virtual machines and containers as a means to provide GPU resources to meet user demands.
2. Solution Research
3. Solution Verification and Implementation
Documenting the principles and implementation methods of GPU passthrough and Docker + MIG solutions.
3.1 GPU Passthrough Solution
Principle Analysis
The GPU passthrough solution has good compatibility with GPUs, low performance overhead, and GPU manufacturers do not charge additional fees, making it widely used by major cloud providers.
Performance Overhead Chart
IOMMU
The main function of IOMMU is to complete mapping, which requires the use of page tables. The page table records the mapping relationship between GPA and HPA. The guest can only see GPA and writes data to GPA (automatically implemented by hardware).
Page table conversion rules: When a device initiates a DMA request, the guest includes its Source Identifier (containing Bus, Device, Func) in the request. IOMMU uses this identifier to point to the base address in the RTADDR_REG space, then finds the corresponding Context Entry in the Context Table using Bus, Device, Func, which is the starting address of the page table, and uses the page table to translate the device’s requested virtual address into a physical address.
The role of IOMMU:
-
Establishes the mapping relationship from GPA to HPA, thereby shielding the guest from direct access to physical addresses, achieving physical address isolation.
-
IOMMU can map contiguous virtual addresses to multiple non-contiguous physical memory segments. In the absence of IOMMU, the physical space accessed by devices must be contiguous, which IOMMU effectively resolves.
VFIO Driver
Virtual Function I/O (VFIO) is a modern device passthrough solution that fully utilizes the DMA Remapping and Interrupt Remapping features provided by VT-d/AMD-Vi technology, ensuring the DMA security of passthrough devices while achieving I/O performance close to that of physical devices. User-space processes can directly access hardware using the VFIO driver, and since the entire process is conducted under the protection of IOMMU, it is very secure, allowing even non-privileged users to use it directly. In other words, VFIO is a complete userspace driver solution because it can safely present device I/O, interrupts, DMA, and other capabilities to user space.
To achieve the highest I/O performance, virtual machines require the VFIO passthrough method, as it features low latency and high bandwidth, and guests can also directly use the native drivers of the devices. These excellent characteristics are attributed to VFIO’s application of the DMA Remapping and Interrupt Remapping mechanisms provided by VT-d/AMD-Vi. VFIO uses DMA Remapping to establish independent IOMMU Page Tables for each Domain, limiting the DMA access of passthrough devices to the address space of the Domain, ensuring the security of user-space DMA, and uses Interrupt Remapping to complete interrupt remapping and Interrupt Posting to achieve interrupt isolation and direct interrupt delivery.
Steps to Implement GPU Device Passthrough
1. The physical machine starts, and the BIOS completes the configuration of PCI devices, including initializing the config space and allocating BAR address space;
2. Since the kernel has IOMMU support enabled, it will allocate an IOMMU group for the current device;
3. Load the VFIO driver and associate it with the GPU card;
4. QEMU starts the virtual machine and passes the GPU card device to the virtual machine;
5. The virtual machine starts, initializing the OCI device according to the PCI topology constructed by QEMU, including configuring the config space and allocating BAR address space;
7. Most instructions are addressed through the page table established by IOMMU (converting GPA to HPA), while some special config registers, such as Max Payload, and virtual machine in/out instructions are addressed through VM exit;
Solution Implementation
Physical Machine Startup Configuration Adjustments
-
Ubuntu System
Enable IOMMU and configure the GPU network card with the VFIO driver
vim /etc/default/grubGRUB_CMDLINE_LINUX_DEFAULT="intel_iommu=on vfio-pci.ids=10de:xxxx quiet"
Upgrade grub:
update-grub
-
Longxin System
Enable IOMMU
grubby --update-kernel="/boot/vmlinuz-`uname -r`" --args="intel_iommu=on"
Configure the GPU network card with the VFIO driver
Add configuration file:
[root@hpclov2020 ~]# cat /etc/modules-load.d/openstack-gpu.confvfio_pci
Add configuration file:
[root@hpclov2020 ~]# cat /etc/modprobe.d/vfio.confoptions vfio-pci ids=10de:xxxx
-
Set Kernel Module Blacklist
Disable NVIDIA and Nouveau drivers
vim /etc/modprobe.d/blacklist.confblacklist nouveauoptions nouveau modeset=0blacklist xhci_hcdblacklist nvidiablacklist nvidia_modesetblacklist nvidia_drmblacklist snd_hda_intelblacklist nvidiafbblacklist astblacklist drm_kms_helperblacklist drm_vram_helperblacklist ttmblacklist drm
-
Verify Configuration Effectiveness
Verify IOMMU: dmesg | grep iommu
Verify GPU using VFIO driver:
lspci -nn | grep NVIDIA
lspci -s 3e:00.0 -k | grep driver
OpenStack Side Adjustments
-
nova-api configuration adjustments
Add configuration:
[pci]alias = {"vendor_id":"10de", "product_id":"xxx", "device_type":"type-PCI", "name":"nvidia-xxx"}
-
nova-compute configuration adjustments
Add configuration:
[pci]passthrough_whitelist = [{"vendor_id":"10de", "product_id":"xxx"}]
-
Create Flavors and Verify
Create trait:
openstack trait create CUSTOM_SHARE_GPU_XXX_HOST_LEVEL3
Set trait:
openstack resource provider trait set --trait CUSTOM_SHARE_GPU_XXX_HOST_LEVEL3 2a5769e2-78fa-4a15-8f36-9b82407c4b56
Create flavors:
# Four flavors for XXXopenstack flavor create --vcpus 10 --ram 102400 --ephemeral 800 --property trait:CUSTOM_SHARE_GPU_XXX_HOST_LEVEL3='required' --property pci_passthrough:alias='nvidia-xxx:1' v.xxxgn3i-1x.c10g100-1iopenstack flavor create --vcpus 20 --ram 204800 --ephemeral 1600 --property trait:CUSTOM_SHARE_GPU_XXX_HOST_LEVEL3='required' --property pci_passthrough:alias='nvidia-xxx:2' v.xxxgn3i-2x.c20g200-2iopenstack flavor create --vcpus 40 --ram 409600 --ephemeral 3200 --property trait:CUSTOM_SHARE_GPU_XXX_HOST_LEVEL3='required' --property pci_passthrough:alias='nvidia-xxx:4' v.xxxgn3i-4x.c40g400-4iopenstack flavor create --vcpus 80 --ram 819200 --ephemeral 6500 --property trait:CUSTOM_SHARE_GPU_XXX_HOST_LEVEL3='required' --property pci_passthrough:alias='nvidia-xxx:8' v.xxxgn3i-8x.c80g800-8i
Create virtual machines:
# zzzc2nova boot --availability-zone nova:hpctur02.aitc.xxx.xxx.net --flavor gpu-flavor --security-groups f9f068b4-f247-4e29-a21d-fc98da18e99f --nic net-id=f0dd296e-dee9-422b-b538-3560fbe145f9 --block-device id=edc2a95c-6e73-4bf8-8891-4bb19d23ca94,source=image,dest=volume,bus=scsi,type=disk,size=50,bootindex=0,shutdown=remove,volume_type=sata test_gpu_08
# zzdtnova boot --availability-zone nova:hpclov2017.aitc.xxx.xxx.net --flavor v.s2.80c800G.gpu --security-groups a37b6670-0fb6-48ba-90cd-468f0532bbe8 --nic net-id=0d683c7b-6275-4f1a-80da-28a52498d14a --block-device id=f592985e-5b34-4dbf-93ff-c9f86be7596a,source=image,dest=volume,bus=scsi,type=disk,size=50,bootindex=0,shutdown=remove,volume_type=sata gpu_test_01
# shyc2nova boot --availability-zone nova:hpclov2016.aitc.xxx.xxx.net --flavor v.xxxgn3i-1x.c10g100-1i --security-groups 7730b14e-4f25-4828-939f-f2f094cd2ee9 --nic net-id=6d2395f0-2f95-48c1-ae5d-073b95637e0e --block-device id=2f9e989d-de4e-4e43-a56d-28105a4fd088,source=image,dest=volume,bus=scsi,type=disk,size=50,bootindex=0,shutdown=remove,volume_type=sata hpclov2016v.aitc.xxx.xxx.net
Bind FIP
neutron floatingip-associate ffc8c74b-299d-45e7-9780-19d90e703e69 74a6c1c9-9bff-43cc-8341-572276090fa2
Log in to the virtual machine to verify:
lspci -nn | grep -i nvidia
3.2 Docker + MIG Solution
Solution Implementation
Create MIG Instance
Enable MIG functionality on GPU device 0: nvidia-smi -i 0 -mig 1
Check if MIG is enabled on GPU device 0: nvidia-smi -i 0 –query-gpu=pci.bus_id,mig.mode.current –format=csv
View MIG profiles: nvidia-smi mig -lgip
Create MIG: nvidia-smi mig -cgi 14,14,14,14 -C
View allocated instances: nvidia-smi -L
Install NVIDIA Driver
APQA Cloud Disk – /foxi/Virtualization/vGPU/ (buduanwang.vip)
yum install dkmscurl -SL https://foxi.buduanwang.vip/pan/foxi/Virtualization/vGPU/NVIDIA-Linux-x86_64-510.85.03-vgpu-kvm.run -Osh NVIDIA-Linux-x86_64-510.85.03-vgpu-kvm.run --dkms
Install NVIDIA Container Toolkit
#vim /etc/yum.repos.d/nvidia-container-toolkit.repo[nvidia-container-toolkit]name=nvidia-container-toolkitbaseurl=https://nvidia.github.io/libnvidia-container/stable/rpm/$basearchrepo_gpgcheck=1gpgcheck=0enabled=1gpgkey=https://nvidia.github.io/libnvidia-container/gpgkeysslverify=1sslcacert=/etc/pki/tls/certs/ca-bundle.crt
[nvidia-container-toolkit-experimental]name=nvidia-container-toolkit-experimentalbaseurl=https://nvidia.github.io/libnvidia-container/experimental/rpm/$basearchrepo_gpgcheck=1gpgcheck=0enabled=0gpgkey=https://nvidia.github.io/libnvidia-container/gpgkeysslverify=1sslcacert=/etc/pki/tls/certs/ca-bundle.crt
# Installyum install -y nvidia-container-toolkit
Configure Docker Runtime
nvidia-ctk runtime configure --runtime=dockersystemctl restart docker
Pull CUDA Image
docker pull harbor.qihoo.net/nvidia/cuda:11.4.1-base-centos8
Run Container and View MIG Devices
sudo docker run --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=MIG-a75f333a-f233-5ba4-b4c5-65598abb8f33 harbor.qihoo.net/nvidia/cuda:11.4.1-base-centos8 nvidia-smi
4. Summary and Outlook
The 360 Zhihui Cloud platform has successfully implemented GPU passthrough and Docker + MIG solutions, but further exploration is needed for the VGPU solution. Once the VGPU solution is completed, it can provide users with comprehensive GPU virtualization capabilities.