CPU Virtualization Series: VM Entry and Exit

Wang Baisheng, Xie Guangjun

Reading time: 8 minutes

Quick read: 3 minutes

This article is excerpted from the book “In-Depth Exploration of Linux System Virtualization: Principles and Implementation” written by Wang Baisheng and Xie Guangjun, focusing on how the virtual CPU switches between Host mode and Guest mode, and how KVM and the physical CPU save the context of the virtual CPU during this switching.

CPU Virtualization Series: VM Entry and Exit

1

GCC Inline Assembly

The code for entering Guest mode in the KVM module is written using GCC’s inline assembly. To understand this code, we need to briefly introduce the syntax involved in this inline assembly, which has the following basic syntax template:

asm volatile ( assembler template     : output operands                  /* optional */    : input operands                   /* optional */    : list of clobbered registers      /* optional */    );

(1) Keywords asm and volatile

asm is a GCC keyword indicating that the following code is embedded assembly. If asm conflicts with other names in the program, __asm__ can be used.

volatile is an optional keyword indicating that GCC should not optimize the following assembly code. Similarly, GCC also supports __volatile__.

(2) Assembly Instructions (assembler template)

This part contains the assembly instructions to be embedded. Since this is inline assembly code in C, the commands must be enclosed in double quotes. If multiple lines of assembly instructions are embedded, each instruction occupies one line, and each line of instruction is enclosed in double quotes, ending with the suffix \n\t, where \n is an abbreviation for newline and \t is an abbreviation for tab. Since GCC passes each instruction as a string to the assembler AS, we use \n\t as a separator for each instruction. An example code is as follows:

__asm__ ("movl %eax, %ebx \n\t"          "movl $56, %esi \n\t"          "movl %ecx, $label(%edx,%ebx,$4) \n\t"          "movb %ah, (%ebx) \n\t");

When using the extended mode, which includes output, input, and clobber list parts, two “%” are needed to reference registers in the assembly instructions, such as %%rax; one “%” is used to reference input and output operands, such as %1, to help GCC distinguish between registers and operands provided by C.

(3) Output Operands

Inline assembly can have zero or more output operands, which indicate that the inline assembly instruction modifies variables in C code. If there are multiple output parameters, each output parameter needs to be separated. The format for each output operand is:

[[asmSymbolicName]] constraint (cvariablename)

We can specify a name for the output operand asmSymbolicName, which can be referenced in the assembly instruction.

In addition to using names to reference operands, we can also use indices to reference operands. For example, if there are two output operands, %0 can reference the first output operand, %1 can reference the second operand, and so on.

The constraint part of the output operand must be prefixed with “=” or “+”; “=” indicates write-only, while “+” indicates read-write. After the prefix, various constraints can follow, such as “=a” indicating that the result is first output to the rax/eax register, and then the corresponding output variable is updated from the rax/eax register.

cvariablename is the name of the C variable in the code, which needs to be enclosed in parentheses.

(4) Input Operands

Inline assembly can have zero or more input operands, which come from variables or expressions in C code and serve as input to the assembly instruction. The format for each input operand is as follows:

[[asmSymbolicName]] constraint (cexpression)

Similar to output operands, we can specify a name for each input operand asmSymbolicName, which can be referenced in the assembly instruction.

In addition to using names to reference input operands, we can also use indices to reference input operands. The indices for input operands start from the last output operand’s index plus one. For example, if there are two output operands and three input operands, %2 can reference the first input operand, %3 can reference the second operand, and so on.

Input operands do not need to start with “=” or “+”; otherwise, the prefix is basically the same as for output operands. In addition to register constraints, we will also see the “i” constraint later in the code, indicating that this input operand is an immediate integer.

cexpression is the C variable or expression in the code, which needs to be enclosed in parentheses.

(5) Clobber List

Some assembly instructions may have side effects that can implicitly affect certain registers or memory values. If the affected registers or memory are not listed in the input or output operands, they need to be included in the clobber list. This way, inline assembly informs GCC that it needs to take care of these affected registers or memory, such as saving the registers before executing the inline assembly instruction and restoring their values afterward if necessary.

Next, let’s look at a specific example. This example is an addition operation, where one operand is val with a value of 100, and the other operand is an immediate value of 400, with the result stored in the variable sum:

int val = 100, sum = 0;
asm ("movl %1, %%rax; \n\t"       "movl %c[addend], %%rbx; \n\t"       "addl %%rbx, %%rax; \n\t"       "movl %%rax, %0; \n\t"
       : "="(sum)      : "c"(val), [addend]"i"(400)       : "rbx"      );

Let’s first look at the assembly instruction on line 3. Since there are register references and operand references by index, two “%” are used to reference registers. %1 references the input operand val, where “c” indicates that the value of val is stored in the rcx register, meaning that before executing this assembly instruction, the value of val is first assigned to the rcx register, and then the assembly instruction assigns the value of the rcx register to the rax register.

The assembly instruction on line 4 references addend, which is the symbolic name of the second input operand. Since this is an immediate value, the variable is prefixed with “c”. This is a GCC syntax indicating that what follows is an immediate value.

The fifth instruction computes the sum of the rbx and rax registers and stores the result in the rax register.

The sixth instruction’s %0 references the output operand sum, which is a variable in C code. Since sum is a write-only output operand, the constraint “=” is used. Therefore, the sixth line of assembly instruction stores the computed result into the variable sum.

From this code, we see that the assembly code uses the rbx register, which does not appear in the output or input operands, so inline assembly needs to include the rbx register in the clobber list, as seen in line 10 of the code, informing GCC that the assembly instruction has polluted the rbx register, and if necessary, it needs to save the rbx register before executing the inline assembly instruction and restore it afterward.

2

VM Entry and Exit and Related Context Saving

Having understood the syntax of inline assembly, we will now explore the inline assembly instructions for VM entry and exit:

static void vmx_vcpu_run(struct kvm_vcpu *vcpu) {     struct vcpu_vmx *vmx = to_vmx(vcpu);     …     asm(         /* Store host registers */         "push %%"R"dx; push %%"R"bp;"         "push %%"R"cx \n\t"         "cmp %%"R"sp, %c[host_rsp](%0) \n\t"         "je 1f \n\t"         "mov %%"R"sp, %c[host_rsp](%0) \n\t"         __ex(ASM_VMX_VMWRITE_RSP_RDX) "\n\t"         "1: \n\t"         /* Reload cr2 if changed */         "mov %c[cr2](%0), %%"R"ax \n\t"         "mov %%cr2, %%"R"dx \n\t"         "cmp %%"R"ax, %%"R"dx \n\t"         "je 2f \n\t"         "mov %%"R"ax, %%cr2 \n\t"         "2: \n\t"         /* Check if vmlaunch of vmresume is needed */         "cmpl $0, %c[launched](%0) \n\t"         /* Load guest registers.  Don't clobber flags. */         "mov %c[rax](%0), %%"R"ax \n\t"         "mov %c[rbx](%0), %%"R"bx \n\t"         …         "mov %c[rcx](%0), %%"R"cx \n\t" /* kills %0 (ecx) */
         /* Enter guest mode */         "jne .Llaunched \n\t"         __ex(ASM_VMX_VMLAUNCH) "\n\t"         "jmp .Lkvm_vmx_return \n\t"         ".Llaunched: " __ex(ASM_VMX_VMRESUME) "\n\t"         ".Lkvm_vmx_return: "         /* Save guest registers, load host registers, keep …*/         "xchg %0,     (%%"R"sp) \n\t"         "mov %%"R"ax, %c[rax](%0) \n\t"         "mov %%"R"bx, %c[rbx](%0) \n\t"         "pop"Q" %c[rcx](%0) \n\t"         "mov %%"R"dx, %c[rdx](%0) \n\t"         …         "mov %%cr2, %%"R"ax   \n\t"         "mov %%"R"ax, %c[cr2](%0) \n\t"
         "pop  %%"R"bp; pop  %%"R"dx \n\t"         "setbe %c[fail](%0) \n\t"           : : "c"(vmx), "d"((unsigned long)HOST_RSP),         [launched]"i"(offsetof(struct vcpu_vmx, launched)),         [fail]"i"(offsetof(struct vcpu_vmx, fail)),         [host_rsp]"i"(offsetof(struct vcpu_vmx, host_rsp)),         [rax]"i"(offsetof(struct vcpu_vmx,                     vcpu.arch.regs[VCPU_REGS_RAX])),         [rbx]"i"(offsetof(struct vcpu_vmx,                     vcpu.arch.regs[VCPU_REGS_RBX])),         …         [cr2]"i"(offsetof(struct vcpu_vmx, vcpu.arch.cr2))           : "cc", "memory"         , R"ax", R"bx", R"di", R"si" #ifdef CONFIG_X86_64         , "r8", "r9", "r10", "r11", "r12", "r13", "r14", "r15" #endif           );     … }

When the CPU switches from Host mode to Guest mode, it does not automatically save certain registers, typically general-purpose registers. Therefore, line 7 of the code saves the host’s general-purpose registers onto the stack. When a VM exit occurs, KVM restores these saved host general-purpose registers from the stack to the physical CPU registers. Here, the macro R is r under 64-bit and e under 32-bit, so by defining this macro, it supports both 64-bit and 32-bit in a more concise manner at the coding level. However, readers may wonder why only these two registers are saved. In fact, the initial implementation of KVM pushed all general-purpose registers onto the stack. Later, the clobber list feature of GCC inline assembly was used to write all registers that might be affected by the inline assembly code into the clobber list, and GCC itself is responsible for saving and restoring the contents of these registers. Lines 57-61 of the code are the clobber list. There are two special registers here: rdx/edx and rbp/ebp, where the rdx/edx register is reserved by GCC’s regparm feature and cannot be placed in the clobber list, and the rbp/ebp register is also ineffective, so KVM manually saves these two registers.Additionally, KVM saves the rcx/ecx registers on line 8 of the code. The rcx/ecx registers have a special mission. When exiting from Guest to Host, the CPU does not automatically save some registers of the Guest, typically general-purpose registers. KVM manually saves them into the substructure of the vcpu_vmx structure. Therefore, at the moment of exiting the Guest, it is necessary to first obtain an instance of the vcpu_vmx structure, which is the variable vmx in line 3 of the code, to save the state of the CPU registers into this vmx, meaning that after saving the Guest’s state, other operations can be performed to avoid corrupting the Guest’s state. Thus, just before each entry into the Guest, KVM pushes the address of vmx onto the stack, and then when exiting the Guest, it immediately retrieves vmx from the top of the stack. How is vmx pushed onto the stack? Refer to line 47 of the code, where the input constraint of GCC inline assembly is used, indicating to the compiler to load the variable vmx into the rcx/ecx registers before executing the assembly code. Therefore, when executing line 8 of the code, which pushes the contents of the rcx/ecx registers onto the stack, it actually pushes the variable vmx onto the top of the stack.When exiting the Guest, the CPU automatically restores the VMCS’s Host rsp/esp registers to the physical CPU’s rsp/esp registers, so the VCPU thread’s stack in Host mode can be accessed. In the first line of code after exiting the Guest, which is line 36, the xchg instruction is called to exchange the value at the top of the stack with the variable indicated by %0. According to line 47 of the code, %0 indicates the variable vmx, corresponding to the rcx/ecx registers. This means that this line of code restores the address of the variable vmx, which was saved at the top of the stack before entering the Guest, back to the rcx/ecx registers, and %0 also references this address, so the Guest’s registers can be saved using %0.Readers may ask, since the Guest does not use the variable vmx and does not corrupt it, can the Host directly use this variable? In fact, from a low-level perspective, for the variable vmx stored on the stack, GCC typically uses the stack frame base pointer rbp/ebp or register references. However, at the first moment of exiting the Guest, except for special registers, the general-purpose registers contain the state of the Guest, so naturally, it cannot be referenced through rbp/ebp with an offset. Since the CPU automatically restores the Host’s stack pointer when exiting the Guest, KVM cleverly utilizes this point, using the top of the stack to save vmx. Then, by exchanging the variable at the top of the stack with the rcx/ecx registers, it achieves referencing vmx in the rcx/ecx registers while also saving the state of the Guest’s rcx/ecx registers onto the stack.Having obtained the address of the saved Guest state, we will now save the Guest state, as seen in lines 37-43 of the code.The first line of code after exiting the Guest (line 36) saves the value of the Guest’s rcx/ecx registers onto the stack, so line 39 retrieves the value of the Guest’s rcx/ecx from the top of the stack into the corresponding position in memory for saving the Guest state.Not every time the Guest exits to enter, the Host’s stack will change, so the Host’s rsp/esp does not need to be updated every time. Only when rsp/esp changes does it need to update the VMCS’s Host rsp/esp field to reduce unnecessary VMCS write operations. Therefore, KVM records the value of host_rsp in the VCPU to compare whether rsp/esp has changed, as seen in lines 9-13 of the code.The instruction to write the Host’s rsp/esp into the VMCS is:

ASM_VMX_VMWRITE_RSP_RDX

The instruction to write to the VMCS has two parameters: one indicates which field in the VMCS to write to, and the other is the value to be written. rsp/esp is easy to understand, indicating that the value to be written is in the rsp/esp register. So what is rdx? Refer to line 47 of the code for the constraint on the rdx/edx registers:

"d"((unsigned long)HOST_RSP)

Combining with the definition of the macro HOST_RSP:

/* VMCS Encodings */enum vmcs_field {    …    HOST_RSP                        = 0x00006c14,    …};

It can be seen that ASM_VMX_VMWRITE_RSP_RDX writes the value of rsp/esp into the VMCS’s Host rsp field.

VMX does not define the CPU to automatically save the cr2 register, but in fact, the Host may change the value of cr2, as shown in the following code:

commit 1c696d0e1b7c10e1e8b34cb6c797329e3c33f262KVM: VMX: Simplify saving guest rcx in vmx_vcpu_runlinux.git/arch/x86/kvm/x86.c
void kvm_inject_page_fault(struct kvm_vcpu *vcpu, …){    ++vcpu->stat.pf_guest;    vcpu->arch.cr2 = fault->address;    kvm_queue_exception_e(vcpu, PF_VECTOR, fault->error_code);}

Therefore, before entering the Guest, KVM checks whether the physical CPU’s cr2 register is the same as the Guest’s cr2 register saved in the VCPU. If they are different, it needs to use the Guest’s cr2 register to update the physical CPU’s cr2 register, as seen in lines 14-20 of the code. However, in most cases, the value of the cr2 register does not change from the Guest exit to the next entry into the Guest. On the other hand, loading the cr2 register is costly, so it only needs to be reloaded when the cr2 register changes.

Some Guest exits are caused by page faults, such as accessing I/O devices via MMIO, and the address of the page fault is recorded in the cr2 register. Therefore, when the Guest exits, KVM needs to save the Guest’s cr2, as seen in lines 42-43 of the code. Due to instruction format limitations, the mov instruction does not support copying from control registers to memory addresses, so it needs to be done via the rax/eax registers.

Before entering the Guest, in addition to loading the cr2 register, it is also necessary to load those general-purpose registers that the physical CPU does not automatically load, as seen in lines 24-27 of the code.

Considering that xchg is an atomic operation that locks the address bus, KVM later abandoned this instruction to design a new scheme. KVM allocates a position on the VCPU’s stack for the Guest’s rcx/ecx registers. Thus, when the Guest exits, before using the rcx/ecx registers to reference the variable vmx, it can temporarily save the Guest’s rcx/ecx registers to the reserved position on the VCPU’s stack:

commit 40712faeb84dacfcb3925a88231daa08b3624d34KVM: VMX: Avoid atomic operation in vmx_vcpu_runlinux.git/arch/x86/kvm/vmx.c
 static void vmx_vcpu_run(struct kvm_vcpu *vcpu) {     …     asm(         /* Store host registers */         "push %%"R"dx; push %%"R"bp;"         "push %%"R"cx \n\t" /* placeholder for guest rcx */         "push %%"R"cx \n\t"         …         ".Lkvm_vmx_return: "         /* Save guest registers, load host registers, …*/         "mov %0, %c[wordsize](%%"R"sp) \n\t"         "pop %0 \n\t"         "mov %%"R"ax, %c[rax](%0) \n\t"         "mov %%"R"bx, %c[rbx](%0) \n\t"         "pop"Q" %c[rcx](%0) \n\t"     …         [wordsize]"i"(sizeof(ulong))     … }

Line 7 of the code is where KVM reserves space for the Guest’s rcx/ecx registers on the stack, and line 8 of the code pushes the variable vmx onto the stack.

At the moment of exiting the Guest, the CPU’s rcx/ecx registers store the state of the Guest, so before using the rcx/ecx registers, the Guest’s state needs to be saved. The saving position is the one reserved for it on the stack, which is the next position after the top of the stack, as seen in line 12 of the code, which is the top of the stack plus an offset of one word.

After saving the Guest’s value, the rcx/ecx registers can be used. Line 13 of the code pops the value at the top of the stack, which is vmx, into the rcx/ecx registers. After popping vmx from the top of the stack, the next value is the Guest’s rcx/ecx registers, so line 16 of the code saves the Guest’s rcx/ecx registers into the corresponding register array in the VCPU structure.

CPU Virtualization Series: VM Entry and Exit

Author Biography:

Wang Baisheng

Senior technical expert, previously worked at the Institute of Software, Chinese Academy of Sciences, Red Flag Linux, and Baidu, currently serves as Chief Architect at Baidu. He has many years of rich practical experience in operating systems, virtualization technology, distributed systems, cloud computing, and autonomous driving.

Author of the bestselling book “In-Depth Exploration of the Linux Operating System” (published in 2013).

Xie Guangjun

PhD in Computer Science, graduated from the Computer Department of Nankai University.

Senior technical expert with many years of experience in the IT industry. Currently serves as Vice President of Baidu Intelligent Cloud, responsible for the R&D of cloud computing-related products. He has been engaged in R&D work in operating systems, virtualization technology, distributed systems, big data, and cloud computing for many years, with rich practical experience.

*This article is published with the authorization of the publisher. For more content on virtualization technology, we recommend reading “In-Depth Exploration of Linux System Virtualization: Principles and Implementation”.

– EOF –

CPU Virtualization Series: VM Entry and Exit

For those who want to join the middle-aged architecture group, please add the group partner Da Bai on WeChat.

Application notes (name + company + technical direction) are required for approval!

CPU Virtualization Series: VM Entry and Exit

CPU Virtualization Series: VM Entry and Exit

Recommended Articles

CPU Virtualization Series: VM Entry and Exit

RocketMQ Expert Ding Wei: Comparison of Kafka and RocketMQ from a Performance Perspective

CPU Virtualization Series: VM Entry and Exit

Decoding Didi’s Black Technology: Ultra-Low Power Orange Vision ADAS Implementation Practice

CPU Virtualization Series: VM Entry and Exit

DDD Expert Zhang Yi: Preface to “Deconstructing Domain-Driven Design”

CPU Virtualization Series: VM Entry and Exit

Zhang Kaijiang: Architectural Capability – The Ability to “Build” the World

CPU Virtualization Series: VM Entry and Exit

Shentong Express’s Cloud-Native Application Practice on Double 11

CPU Virtualization Series: VM Entry and Exit

With a gentle scan, payment is immediately deducted. Don’t you want to know the principles behind payment codes?

   END     
#Architects Must-Have#


Share

Like

View

Leave a Comment