The Four-Step Compilation Process of C Language Explained with Two Files

1. Conclusion

The process is divided into the following four steps.

Preprocessing: This step handles all directives that start with #.

File Inclusion: Inserts the header files specified by the #include directive into the source code.
Macro Expansion: Replaces the macros defined by #define with their corresponding values.
Conditional Compilation: Determines whether to include certain parts of the code based on conditions like #if and #ifdef.

Compilation: The compiler converts the preprocessed code into assembly code, formatted as .s

Assembly: The assembly phase converts the assembly code (.s) into machine code (.o, a collection of binary files that machines can understand, consisting of 0s and 1s).

Linking: The process of linking different object files and library files into a single executable file.

It is normal not to understand it the first time; I didn’t understand it either. Reading through the example below will clarify things.

2. Example

Assume I have two files: main.c and add.c

// main.c

#define NUM 10

int main()
{
    add(5, NUM);
    return 0;
}

// add.c

int add(int a, int b)
{
 return a+b;
}

This example is good for explanation; it has removed non-core content and can demonstrate the process with the simplest case and minimal code.

2.1. Preprocessing

The main.c file contains #define, performing symbol replacement, resulting in the following code:

int main() {
    add(5, 10);  // NUM is expanded to 10
    return 0;
}

The add.c file remains unchanged.

2.2. Compilation

Typically, the gcc tool is used for compilation. Those who often program in Linux should be familiar with this shell command. Those working with microcontrollers may find it unfamiliar; just understand it as a compilation command without delving into details.

gcc -S main.c  # Generates main.s assembly file
gcc -S add.c   # Generates add.s assembly file

The following assembly code is obtained, which is very pure assembly.

# main.s

    .file   "main.c"
    .text
    .globl  main
    .type   main, @function
main:
    push    %rbp
    mov     $5, %eax        # 5 is passed to a
    mov     $10, %ebx       # NUM is replaced with 10, passed to b
    call    add             # Call add function
    pop     %rbp
    ret

# add.s

    .file   "add.c"
    .text
    .globl  add
    .type   add, @function
add:
    push    %rbp
    mov     %edi, %eax      # Store a in eax
    add     %esi, %eax      # Add b to eax
    pop     %rbp
    ret

2.3. Assembly

Execute the following:

gcc -c main.s -o main.o  # Compile main.s to main.o
gcc -c add.s -o add.o    # Compile add.s to add.o

This compiles the .s assembly code obtained in the previous step into .o files. The contents are a series of machine codes. The following code is not to be scrutinized; it is made up, just get the idea.

# main.o

00000000 00000000 00000001 00000010 00000000 00000000 00000000 00000000  # mov eax, 5
00000000 00000000 00000001 00000010 00000000 00000000 00000000 00001010  # mov ebx, 10
00000000 00000000 00000001 00000010 00000000 00000000 00000000 10000000  # call add (address)
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000  # ret
...

# add.o

00000000 00000000 00000001 00000010 00000000 00000000 00000000 01000000  # mov eax, [esp+4]
00000000 00000000 00000001 00000010 00000000 00000000 00000000 01100000  # add eax, [esp+8]
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000  # ret
...

2.4. Linking

This is the final step. Link these machine codes together.

Why do we need to link main.o and add.o? If there is another abcd.o file, should it be linked?

Because main.o uses the function from add.o, they need to be linked; abcd.o is not used, so it is not linked.

gcc main.o add.o -o main  # Link to generate executable file main

Below is the linked executable file, a two-person world without interference from abcd.o.

In reality, the linker performs symbol resolution and relocation, not just a simple concatenation of files. For simplicity, we can understand it as appending; we are not professionals in this area 😋

00000000 00000000 00000001 00000010 00000000 00000000 00000000 00000000  # mov eax, 5
00000000 00000000 00000001 00000010 00000000 00000000 00000000 00001010  # mov ebx, 10
00000000 00000000 00000001 00000010 00000000 00000000 00000000 00100000  # call 0x2000 (address replacement)
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000  # ret

00000000 00000000 00000001 00000010 00000000 00000000 00000000 01000000  # mov eax, [esp+4]
00000000 00000000 00000001 00000010 00000000 00000000 00000000 01100000  # add eax, [esp+8]
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000  # ret
...

.o files are in binary format, containing code segments, data segments, symbol tables, and other structures, not just a simple sequence of machine instructions. You can learn more about it in your work; beginners do not need to get bogged down.

To run the executable file, simply execute the following shell command:

./main