Click the blue text

Detailed Explanation of C Language Program Compilation Process on Linux

Source from the Internet, please delete if infringing

Everyone knows that computer programming languages are usually divided into three categories: machine language, assembly language, and high-level languages. High-level languages need to be translated into machine language to be executed, and there are two types of translation methods: compiled and interpreted. Thus, we can generally classify high-level languages into two categories: compiled languages, such as C, C++, and Java, and interpreted languages, such as Python, Ruby, MATLAB, and JavaScript.

This article will introduce how to convert programs written in high-level C/C++ languages into binary code that can be executed by the processor, including four steps:

Preprocessing
Compilation
Assembly
Linking

Introduction to GCC Toolchain

The GCC we often refer to is short for GNU Compiler Collection, which is a commonly used compilation tool on Linux systems. The GCC toolchain software includes GCC, Binutils, C runtime library, etc.

GCC

GCC (GNU C Compiler) is the compilation tool. The process of converting programs written in C/C++ languages into binary code that can be executed by the processor is done by the compiler.

Binutils

A set of binary program processing tools, including: addr2line, ar, objcopy, objdump, as, ld, ldd, readelf, size, etc. This set of tools is essential for development and debugging, briefly introduced as follows:

addr2line: Used to convert program addresses to their corresponding source files and code lines, and can also obtain the corresponding functions. This tool will help the debugger locate the corresponding source code position during the debugging process.
as: Mainly used for assembly; detailed introduction to assembly can be found later.
ld: Mainly used for linking; detailed introduction to linking can be found later.
ar: Mainly used to create static libraries. To facilitate beginners’ understanding, the concepts of dynamic libraries and static libraries are introduced here:

If multiple .o object files are to be generated into a library file, there are two types of libraries: static libraries and dynamic libraries.
In Windows, static libraries have a .lib suffix, and shared libraries have a .dll suffix. In Linux, static libraries have a .a suffix, and shared libraries have a .so suffix.
The difference between static libraries and dynamic libraries lies in when the code is loaded. The code of static libraries is already loaded into the executable during the compilation process, making the size larger. The code of shared libraries is loaded into memory when the executable runs, with only simple references during the compilation process, resulting in a smaller code size. In Linux systems, the ldd command can be used to view the shared libraries that an executable program depends on.
If multiple programs need to run simultaneously on a system and share libraries, using dynamic libraries will save memory.

ldd: Can be used to view the shared libraries that an executable program depends on.
objcopy: Converts one object file into another format, for example, converting .bin to .elf or .elf to .bin, etc.
objdump: Mainly used for disassembly. Detailed introduction to disassembly can be found later.
readelf: Displays information about ELF files; refer to later for more information.
size: Lists the size of each part of the executable file and the total size, including the code segment, data segment, total size, etc.; refer to later for specific usage examples of size.

C Runtime Library

The C language standard mainly consists of two parts: one describes the syntax of C, and the other describes the C standard library. The C standard library defines a set of standard header files, each containing related functions, variables, type declarations, and macro definitions. For example, the common printf function is a C standard library function, and its prototype is defined in the stdio header file.

The C language standard only defines the prototypes of C standard library functions but does not provide implementations. Therefore, C language compilers typically require support from a C runtime library (C Run Time Library, CRT). The C runtime library is often referred to as the C runtime library. Similar to C, C++ also defines its own standard and provides related support libraries, called C++ runtime libraries.

Preparation Work

Since the GCC toolchain is primarily used in the Linux environment, this article will also use the Linux system as the working environment. To demonstrate the entire compilation process, we will first prepare a simple Hello program written in C as an example, with the source code shown below:

#include <stdio.h> 

// This program is very simple, just prints a Hello World string.
int main(void)
{
  printf("Hello World! \n");
  return 0;
}

Compilation Process

1. Preprocessing

The preprocessing process mainly includes the following steps:

Remove all #define directives, expand all macro definitions, and process all conditional preprocessing directives, such as #if, #ifdef, #elif, #else, #endif, etc.
Process the #include preprocessing directive, inserting the included files into the position of that directive.
Remove all comments “//” and “/* */”.
Add line numbers and file identifiers to generate line numbers for debugging and warning line numbers for compilation errors.
Retain all #pragma compiler directives, as they will be needed in the subsequent compilation process. The command to preprocess using gcc is as follows:

$ gcc -E hello.c -o hello.i // Preprocess the source file hello.c to generate hello.i
                        // The -E option makes GCC stop after preprocessing

The hello.i file can be opened as a regular text file for viewing, with the following code snippet:

// hello.i code snippet

extern void funlockfile (FILE *__stream) __attribute__ ((__nothrow__ , __leaf__));
# 942 "/usr/include/stdio.h" 3 4

# 2 "hello.c" 2


# 3 "hello.c"
int
main(void)
{
  printf("Hello World!" "\n");
  return 0;
}

2. Compilation

The compilation process involves a series of lexical analysis, syntax analysis, semantic analysis, and optimization on the preprocessed file to generate the corresponding assembly code.

The command to compile using gcc is as follows:

$ gcc -S hello.i -o hello.s // Compile the preprocessed hello.i file to generate assembly program hello.s
                        // The -S option makes GCC stop after compiling, generating the assembly program

The assembly program hello.s generated by the above command has the following code snippet, which is all assembly code.

// hello.s code snippet

main:
.LFB0:
    .cfi_startproc
    pushq   %rbp
    .cfi_def_cfa_offset 16
    .cfi_offset 6, -16
    movq    %rsp, %rbp
    .cfi_def_cfa_register 6
    movl    $.LC0, %edi
    call    puts
    movl    $0, %eax
    popq    %rbp
    .cfi_def_cfa 7, 8
    ret
    .cfi_endproc

3. Assembly

The assembly process involves processing the assembly code to generate instructions recognizable by the processor, saved in an object file with a .o suffix. Since each assembly statement corresponds to a processor instruction, the assembly process is relatively simple. The assembler as in Binutils translates one by one according to the correspondence table between assembly instructions and processor instructions.

When a program consists of multiple source code files, each file must complete the assembly work first to generate .o object files before proceeding to the next step of linking. Note: The object file is already part of the final program, but it cannot be executed before linking.

The command to assemble using gcc is as follows:

$ gcc -c hello.s -o hello.o // Assemble the compiled hello.s file to generate the object file hello.o
                        // The -c option makes GCC stop after assembly, generating the object file
// or directly call as for assembly
$ as -c hello.s -o hello.o // Use Binutils' as to assemble hello.s file to generate object file

Note: The hello.o object file is in ELF (Executable and Linkable Format) format and is a relocatable file.

4. Linking

Linking can be divided into static linking and dynamic linking, with the following key points:

Static linking means directly adding static libraries into the executable file during the compilation phase, resulting in a larger executable file. The linker copies the function code from its location (from different object files or static libraries) into the final executable program. The main tasks that the linker must complete to create an executable file are: symbol resolution (linking the definitions and references of symbols in the object files) and relocation (linking symbol definitions to memory addresses and then modifying all references to the symbols).
Dynamic linking means that during the linking phase, only some descriptive information is added, and the corresponding dynamic libraries are loaded into memory from the system when the program executes.

In Linux systems, the order of dynamic library search paths during gcc compilation linking is usually as follows: first look in the paths specified by the -L option of the gcc command; then search the paths specified by the LIBRARY_PATH environment variable; finally look in the default paths /lib, /usr/lib, /usr/local/lib.
In Linux systems, the order of dynamic library search paths when executing binary files is usually as follows: first search the dynamic library search paths specified during the compilation of the target code; then search the paths specified by the LD_LIBRARY_PATH environment variable; then look in the dynamic library search paths specified in the configuration file /etc/ld.so.conf; finally look in the default paths /lib, /usr/lib.
In Linux systems, the ldd command can be used to view the shared libraries that an executable program depends on.

Since the paths for linking dynamic libraries and static libraries may overlap, if there are identically named static library and dynamic library files in the paths, such as libtest.a and libtest.so, gcc will default to choose the dynamic library and link libtest.so. If you want gcc to choose to link libtest.a, you can specify the gcc option -static, which will force the use of static libraries for linking. Taking Hello World as an example:

If you use the command “gcc hello.c -o hello”, it will link with dynamic libraries, and the size of the generated ELF executable file (checked using the size command from Binutils) and the linked dynamic libraries (checked using the ldd command from Binutils) are as follows:

$ gcc hello.c -o hello
$ size hello  // Use size to check size
   text    data     bss     dec     hex filename
   1183     552       8    1743     6cf     hello
$ ldd hello // It can be seen that this executable file links many other dynamic libraries, mainly the Linux glibc dynamic library
        linux-vdso.so.1 =>  (0x00007fffefd7c000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fadcdd82000)
        /lib64/ld-linux-x86-64.so.2 (0x00007fadce14c000)

If you use the command “gcc -static hello.c -o hello”, it will link with static libraries, and the size of the generated ELF executable file (checked using the size command from Binutils) and the linked dynamic libraries (checked using the ldd command from Binutils) are as follows:

$ gcc -static hello.c -o hello
$ size hello // Use size to check size
     text    data     bss     dec     hex filename
 823726    7284    6360  837370   cc6fa     hello // It can be seen that the size of the text code becomes extremely large
$ ldd hello
       not a dynamic executable // Indicates that no dynamic libraries are linked

The final file generated by the linker is an ELF format executable file. An ELF executable file is usually linked into different segments, commonly .text, .data, .rodata, .bss, etc.

Analyzing ELF Files

1. ELF File Segments

The ELF file format is shown in the following figure, with segments (Sections) located between the ELF Header and Section Header Table. A typical ELF file contains the following segments:

.text: The instruction code segment of the compiled program.
.rodata: ro stands for read-only, meaning read-only data (e.g., constants).
.data: Initialized global variables and static local variables of the C program.
.bss: Uninitialized global variables and static local variables of the C program.
.debug: Debugging symbol table, which helps the debugger.

One can use readelf -S to view the information of each section as follows:

$ readelf -S hello
There are 31 section headers, starting at offset 0x19d8:

Section Headers:
  [Nr] Name              Type             Address           Offset
       Size              EntSize          Flags  Link  Info  Align
  [ 0]                   NULL             0000000000000000  00000000
       0000000000000000  0000000000000000           0     0     0
……
  [11] .init             PROGBITS         00000000004003c8  000003c8
       000000000000001a  0000000000000000  AX       0     0     4
……
  [14] .text             PROGBITS         0000000000400430  00000430
       0000000000000182  0000000000000000  AX       0     0     16
  [15] .fini             PROGBITS         00000000004005b4  000005b4
……

2. Disassembling ELF

Since ELF files cannot be opened as regular text files, if you want to directly view the instructions and data contained in an ELF file, you need to use disassembly methods.

Use objdump -D to disassemble as follows:

$ objdump -D hello
……
0000000000400526 <main>:  // PC address of the main label
// PC address: Instruction encoding                  Assembly format of the instruction
  400526:    55                          push   %rbp 
  400527:    48 89 e5                mov    %rsp,%rbp
  40052a:    bf c4 05 40 00          mov    $0x4005c4,%edi
  40052f:    e8 cc fe ff ff          callq  400400 <puts@plt>
  400534:    b8 00 00 00 00          mov    $0x0,%eax
  400539:    5d                      pop    %rbp
  40053a:    c3                          retq   
  40053b:    0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
……

Use objdump -S to disassemble and mix the C source code:

$ gcc -o hello -g hello.c // Must add the -g option
$ objdump -S hello
……
0000000000400526 <main>:
#include <stdio.h>

int
main(void)
{
  400526:    55                          push   %rbp
  400527:    48 89 e5                mov    %rsp,%rbp
  printf("Hello World!" "\n");
  40052a:    bf c4 05 40 00          mov    $0x4005c4,%edi
  40052f:    e8 cc fe ff ff          callq  400400 <puts@plt>
  return 0;
  400534:    b8 00 00 00 00          mov    $0x0,%eax
}
  400539:    5d                          pop    %rbp
  40053a:    c3                          retq   
  40053b:    0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
……


If you are over 18 years old and find learning <strong>C language</strong> too difficult? Want to try other programming languages? Then I recommend you learn <strong>Python</strong>. Currently, a free Python zero-based course worth 499 yuan is available for a limited time, with only 10 spots!



▲ Scan the QR code - Get it for free

Click to read the original text for more information

Source from the Internet, please delete if infringing

Related posts

Leave a Comment Cancel reply