C Language and Assembly Correspondence Analysis Revealing Function Call Essence

Recently, NetEase Cloud Classroom opened a course called Linux Kernel Analysis. I have always been interested in operating systems and the essence of computers, so I took a look. In the first class, the teacher asked students to write a blog about the first lesson as an assignment. I was quite surprised by this novel form of assignment. Well, as a task, I will complete it, and it is also a good opportunity to digest the material. This article will compile a piece of C code into assembly and provide analysis and my thoughts.

First, let’s list some CPU registers and basic assembly knowledge that will be involved:

● The names of CPU registers differ for 16-bit, 32-bit, and 64-bit architectures. For example, the instruction pointer register is called ip in 16-bit, eip in 32-bit, and rip in 64-bit.

● 32-bit assembly instructions usually end with ‘l’, for example, movl corresponds to the meaning of mov.

● ebp: Stack base pointer register, this register holds the base address of the current execution thread’s stack.

● esp: Stack top pointer register, this register holds the top address of the current execution thread’s stack.

● eip: Instruction pointer register, this register holds the address of the instruction. The CPU continuously fetches and executes instructions from memory based on the address pointed to by eip, incrementing to fetch the next instruction sequentially. eip cannot be directly assigned a value; instructions like call, ret, jmp can modify eip.

● % is used for direct addressing registers, and $ is used to represent immediate values. movl $8, %eax means to store the immediate value 8 into eax.

● () is used for memory indirect addressing, for example, movl $10, (%esp) means to save the immediate value 10 into the memory address pointed to by esp.

● 8(%ebp) means to find the address value pointed to by ebp and add 8 to get the resulting address.

● The stack address grows downwards, meaning the stack top moves from high addresses to low addresses.

Preparation Work

Prepare a piece of C code:

int g(int x)

{

return x+5;

}

int f(int x)

{

return g(x);

}

int main(void)

{

return f(10)+1;

}

Using Experiment Environment

Compile into Assembly Code

Use the following command to compile the above C code:

gcc -S -o main.s main.c -m32

After removing unnecessary parts, we get:

Assembly code result:

g:

pushl %ebp

movl %esp, %ebp

movl 8(%ebp), %eax

addl $5, %eax

popl %ebp

ret

f:

pushl %ebp

movl %esp, %ebp

subl $4, %esp

movl 8(%ebp), %eax

movl %eax, (%esp)

call g

leave

ret

main:

pushl %ebp

movl %esp, %ebp

subl $4, %esp

movl $10, (%esp)

call f

addl $1, %eax

leave

ret

Analysis

For a detailed step-by-step analysis, I’ll skip that here as the teacher explained it very thoroughly in class. The main goal here is to reflect and summarize.

First, we see that three C functions correspond to three parts of assembly code, separated by function names as labels:

int g(int x) -> g:

int f(int x) -> f:

int main(void) -> main:

We know that the program starts executing from the main function, so when the program is loaded and run, the above assembly code will be loaded into a certain area of memory. Moreover, many registers in the CPU will be initialized, among which the most important is eip, as eip points to the address of the next instruction to be executed, so at this point, eip should point to the instruction under the main label:

main:

eip -> pushl %ebp

The program begins execution…

Let’s take a look at these two instructions first:

pushl %ebp

movl %esp, %ebp

Now observe the entire code, you may notice that not only the main function but also the beginning of functions f and g contain these two instructions. Analyzing this, it is not difficult to conclude that these two instructions save the current stack base address onto the stack and reset the base address to the top of the stack. This essentially saves the current base address and starts a new stack. Since functions can call other functions, the current base address here actually is the stack base address of the previous function. For example, in the f function, these two instructions actually save the stack base address of the main function.

Next, analyze these two instructions:

subl $4, %esp

movl $10, (%esp)

Comparing with the C code, it is not difficult to find that this is pushing parameters onto the stack, saving the immediate value 10 to the top of the stack (the memory address pointed to by esp is the top of the stack). Similarly, in the f function, we can find similar statements:

subl $4, %esp

movl 8(%ebp), %eax

movl %eax, (%esp)

Thus, we can conclude that parameters need to be pushed onto the stack one by one before calling a function, and based on my testing, the order of pushing is from right to left.

Then the call instruction is executed, jumping to the f function. We know that the call instruction is equivalent to the following pseudocode:

pushl %eip + 1

movl %eip, f

This means that the address of the next instruction after the call instruction is pushed onto the stack, and eip is assigned the address of the first instruction of the target function. This is clearly necessary: when the called function ends, it needs to return to the current function to continue execution, so the next instruction must be saved; otherwise, it will not be found when returning.

Arriving at the f function, the first step is to save the stack base address of the main function, and then it needs to call the g function, so the parameters need to be pushed onto the stack first:

subl $4, %esp

movl 8(%ebp), %eax

movl %eax, (%esp)

Here, let’s think critically about how the f function gets the parameters passed from the main function. We see:

movl 8(%ebp), %eax

movl 8(%ebp), %eax

Why is the parameter obtained from 8(%ebp)? We know that 8(%ebp) means to trace back 8 bytes from the base address of ebp. Why is it 8 bytes?

Recall that in the main function, after completing the parameter push onto the stack, two things were done:

1. Due to the effect of the call f instruction, the address of the next instruction after call f was pushed onto the stack, which occupies 4 bytes.

2. Upon entering the f function, the stack base address of the main function was immediately pushed onto the stack, and ebp was adjusted to point to the top of the stack (esp), which also occupies 4 bytes.

Thus, by using 8(%ebp), we can find the value of the first integer parameter from the previous function.

A diagram to illustrate:

After observing the process of entering and calling functions, let’s look at how functions exit. It is not difficult to see that both main and f use the following instructions to exit:

leave

ret

The leave instruction is equivalent to:

movl %ebp, %esp

popl %ebp

● The first instruction resets esp to ebp, which can be understood as clearing the stack used by the current function.

● The second instruction assigns the top value of the stack to ebp and pops it off. What is the top value of the stack? From the analysis above, it is not difficult to see that the current top value of the stack is actually the stack base address of the previous function, so the second instruction means restoring ebp to the stack base address of the previous function.

Next, the ret instruction is equivalent to restoring the instruction pointer:

popl %eip

Why doesn’t the g function have a leave? Because the g function does not have any variable declarations, and the function call stack has always been empty, so the compiler optimized the instruction.

Summary

Finally, through this example, let’s summarize the process of function calls:

Entering a function:

The current stack base address is pushed onto the stack (the current stack base address is actually the stack base address of the previous function).

Calling other functions:

1. Parameters are pushed onto the stack from right to left.

2. The address of the next instruction is pushed onto the stack.

Exiting a function:

1. The top of the stack (esp) is reset to the ebp of the current function.

2. The base address is restored to the base address of the previous function.

3. The eip is returned to the address of the next instruction to be executed in the previous function.

Source: P_Chou Tech Space, Author: Zhou Ping

Link: http://www.pchou.info/c-cpp/2015/03/03/c-and-asm.html

Related posts

Leave a Comment Cancel reply