
Source from the Internet, please delete if infringing
Everyone knows that computer programming languages are usually divided into three categories: machine language, assembly language, and high-level languages. High-level languages need to be translated into machine language to be executed, and there are two types of translation methods: compiled and interpreted. Thus, we can generally classify high-level languages into two categories: compiled languages, such as C, C++, and Java, and interpreted languages, such as Python, Ruby, MATLAB, and JavaScript.
This article will introduce how to convert programs written in high-level C/C++ languages into binary code that can be executed by the processor, including four steps:
-
Preprocessing
-
Compilation
-
Assembly
-
Linking
Introduction to GCC Toolchain
The GCC we often refer to is short for GNU Compiler Collection, which is a commonly used compilation tool on Linux systems. The GCC toolchain software includes GCC, Binutils, C runtime library, etc.
GCC
GCC (GNU C Compiler) is the compilation tool. The process of converting programs written in C/C++ languages into binary code that can be executed by the processor is done by the compiler.
Binutils
A set of binary program processing tools, including: addr2line, ar, objcopy, objdump, as, ld, ldd, readelf, size, etc. This set of tools is essential for development and debugging, briefly introduced as follows:
-
addr2line: Used to convert program addresses to their corresponding source files and code lines, and can also obtain the corresponding functions. This tool will help the debugger locate the corresponding source code position during the debugging process.
-
as: Mainly used for assembly; detailed introduction to assembly can be found later.
-
ld: Mainly used for linking; detailed introduction to linking can be found later.
-
ar: Mainly used to create static libraries. To facilitate beginners’ understanding, the concepts of dynamic libraries and static libraries are introduced here:
-
If multiple .o object files are to be generated into a library file, there are two types of libraries: static libraries and dynamic libraries.
-
In Windows, static libraries have a .lib suffix, and shared libraries have a .dll suffix. In Linux, static libraries have a .a suffix, and shared libraries have a .so suffix.
-
The difference between static libraries and dynamic libraries lies in when the code is loaded. The code of static libraries is already loaded into the executable during the compilation process, making the size larger. The code of shared libraries is loaded into memory when the executable runs, with only simple references during the compilation process, resulting in a smaller code size. In Linux systems, the ldd command can be used to view the shared libraries that an executable program depends on.
-
If multiple programs need to run simultaneously on a system and share libraries, using dynamic libraries will save memory.
-
ldd: Can be used to view the shared libraries that an executable program depends on.
-
objcopy: Converts one object file into another format, for example, converting .bin to .elf or .elf to .bin, etc.
-
objdump: Mainly used for disassembly. Detailed introduction to disassembly can be found later.
-
readelf: Displays information about ELF files; refer to later for more information.
-
size: Lists the size of each part of the executable file and the total size, including the code segment, data segment, total size, etc.; refer to later for specific usage examples of size.
C Runtime Library
The C language standard mainly consists of two parts: one describes the syntax of C, and the other describes the C standard library. The C standard library defines a set of standard header files, each containing related functions, variables, type declarations, and macro definitions. For example, the common printf function is a C standard library function, and its prototype is defined in the stdio header file.
The C language standard only defines the prototypes of C standard library functions but does not provide implementations. Therefore, C language compilers typically require support from a C runtime library (C Run Time Library, CRT). The C runtime library is often referred to as the C runtime library. Similar to C, C++ also defines its own standard and provides related support libraries, called C++ runtime libraries.
Preparation Work
#include <stdio.h>
// This program is very simple, just prints a Hello World string.
int main(void)
{
printf("Hello World! \n");
return 0;
}
Compilation Process
1. Preprocessing
The preprocessing process mainly includes the following steps:
-
Remove all #define directives, expand all macro definitions, and process all conditional preprocessing directives, such as #if, #ifdef, #elif, #else, #endif, etc.
-
Process the #include preprocessing directive, inserting the included files into the position of that directive.
-
Remove all comments “//” and “/* */”.
-
Add line numbers and file identifiers to generate line numbers for debugging and warning line numbers for compilation errors.
-
Retain all #pragma compiler directives, as they will be needed in the subsequent compilation process. The command to preprocess using gcc is as follows:
$ gcc -E hello.c -o hello.i // Preprocess the source file hello.c to generate hello.i
// The -E option makes GCC stop after preprocessing
The hello.i file can be opened as a regular text file for viewing, with the following code snippet:
// hello.i code snippet
extern void funlockfile (FILE *__stream) __attribute__ ((__nothrow__ , __leaf__));
# 942 "/usr/include/stdio.h" 3 4
# 2 "hello.c" 2
# 3 "hello.c"
int
main(void)
{
printf("Hello World!" "\n");
return 0;
}
2. Compilation
The compilation process involves a series of lexical analysis, syntax analysis, semantic analysis, and optimization on the preprocessed file to generate the corresponding assembly code.
$ gcc -S hello.i -o hello.s // Compile the preprocessed hello.i file to generate assembly program hello.s
// The -S option makes GCC stop after compiling, generating the assembly program
The assembly program hello.s generated by the above command has the following code snippet, which is all assembly code.
// hello.s code snippet
main:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
movl $.LC0, %edi
call puts
movl $0, %eax
popq %rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
3. Assembly
The assembly process involves processing the assembly code to generate instructions recognizable by the processor, saved in an object file with a .o suffix. Since each assembly statement corresponds to a processor instruction, the assembly process is relatively simple. The assembler as in Binutils translates one by one according to the correspondence table between assembly instructions and processor instructions.
When a program consists of multiple source code files, each file must complete the assembly work first to generate .o object files before proceeding to the next step of linking. Note: The object file is already part of the final program, but it cannot be executed before linking.
The command to assemble using gcc is as follows:
$ gcc -c hello.s -o hello.o // Assemble the compiled hello.s file to generate the object file hello.o
// The -c option makes GCC stop after assembly, generating the object file
// or directly call as for assembly
$ as -c hello.s -o hello.o // Use Binutils' as to assemble hello.s file to generate object file
Note: The hello.o object file is in ELF (Executable and Linkable Format) format and is a relocatable file.
4. Linking
Linking can be divided into static linking and dynamic linking, with the following key points:
-
Static linking means directly adding static libraries into the executable file during the compilation phase, resulting in a larger executable file. The linker copies the function code from its location (from different object files or static libraries) into the final executable program. The main tasks that the linker must complete to create an executable file are: symbol resolution (linking the definitions and references of symbols in the object files) and relocation (linking symbol definitions to memory addresses and then modifying all references to the symbols).
-
Dynamic linking means that during the linking phase, only some descriptive information is added, and the corresponding dynamic libraries are loaded into memory from the system when the program executes.
-
In Linux systems, the order of dynamic library search paths during gcc compilation linking is usually as follows: first look in the paths specified by the -L option of the gcc command; then search the paths specified by the LIBRARY_PATH environment variable; finally look in the default paths /lib, /usr/lib, /usr/local/lib.
-
In Linux systems, the order of dynamic library search paths when executing binary files is usually as follows: first search the dynamic library search paths specified during the compilation of the target code; then search the paths specified by the LD_LIBRARY_PATH environment variable; then look in the dynamic library search paths specified in the configuration file /etc/ld.so.conf; finally look in the default paths /lib, /usr/lib.
-
In Linux systems, the ldd command can be used to view the shared libraries that an executable program depends on.
Since the paths for linking dynamic libraries and static libraries may overlap, if there are identically named static library and dynamic library files in the paths, such as libtest.a and libtest.so, gcc will default to choose the dynamic library and link libtest.so. If you want gcc to choose to link libtest.a, you can specify the gcc option -static, which will force the use of static libraries for linking. Taking Hello World as an example:
-
If you use the command “gcc hello.c -o hello”, it will link with dynamic libraries, and the size of the generated ELF executable file (checked using the size command from Binutils) and the linked dynamic libraries (checked using the ldd command from Binutils) are as follows: $ gcc hello.c -o hello $ size hello // Use size to check size text data bss dec hex filename 1183 552 8 1743 6cf hello $ ldd hello // It can be seen that this executable file links many other dynamic libraries, mainly the Linux glibc dynamic library linux-vdso.so.1 => (0x00007fffefd7c000) libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fadcdd82000) /lib64/ld-linux-x86-64.so.2 (0x00007fadce14c000)
-
If you use the command “gcc -static hello.c -o hello”, it will link with static libraries, and the size of the generated ELF executable file (checked using the size command from Binutils) and the linked dynamic libraries (checked using the ldd command from Binutils) are as follows:
$ gcc -static hello.c -o hello $ size hello // Use size to check size text data bss dec hex filename 823726 7284 6360 837370 cc6fa hello // It can be seen that the size of the text code becomes extremely large $ ldd hello not a dynamic executable // Indicates that no dynamic libraries are linked
The final file generated by the linker is an ELF format executable file. An ELF executable file is usually linked into different segments, commonly .text, .data, .rodata, .bss, etc.
Analyzing ELF Files
1. ELF File Segments
The ELF file format is shown in the following figure, with segments (Sections) located between the ELF Header and Section Header Table. A typical ELF file contains the following segments:
-
.text: The instruction code segment of the compiled program.
-
.rodata: ro stands for read-only, meaning read-only data (e.g., constants).
-
.data: Initialized global variables and static local variables of the C program.
-
.bss: Uninitialized global variables and static local variables of the C program.
-
.debug: Debugging symbol table, which helps the debugger.
One can use readelf -S to view the information of each section as follows:
$ readelf -S hello
There are 31 section headers, starting at offset 0x19d8:
Section Headers:
[Nr] Name Type Address Offset
Size EntSize Flags Link Info Align
[ 0] NULL 0000000000000000 00000000
0000000000000000 0000000000000000 0 0 0
……
[11] .init PROGBITS 00000000004003c8 000003c8
000000000000001a 0000000000000000 AX 0 0 4
……
[14] .text PROGBITS 0000000000400430 00000430
0000000000000182 0000000000000000 AX 0 0 16
[15] .fini PROGBITS 00000000004005b4 000005b4
……
2. Disassembling ELF
Since ELF files cannot be opened as regular text files, if you want to directly view the instructions and data contained in an ELF file, you need to use disassembly methods.
$ objdump -D hello
……
0000000000400526 <main>: // PC address of the main label
// PC address: Instruction encoding Assembly format of the instruction
400526: 55 push %rbp
400527: 48 89 e5 mov %rsp,%rbp
40052a: bf c4 05 40 00 mov $0x4005c4,%edi
40052f: e8 cc fe ff ff callq 400400 <puts@plt>
400534: b8 00 00 00 00 mov $0x0,%eax
400539: 5d pop %rbp
40053a: c3 retq
40053b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
……
Use objdump -S to disassemble and mix the C source code:
$ gcc -o hello -g hello.c // Must add the -g option
$ objdump -S hello
……
0000000000400526 <main>:
#include <stdio.h>
int
main(void)
{
400526: 55 push %rbp
400527: 48 89 e5 mov %rsp,%rbp
printf("Hello World!" "\n");
40052a: bf c4 05 40 00 mov $0x4005c4,%edi
40052f: e8 cc fe ff ff callq 400400 <puts@plt>
return 0;
400534: b8 00 00 00 00 mov $0x0,%eax
}
400539: 5d pop %rbp
40053a: c3 retq
40053b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
……
If you are over 18 years old and find learning <strong>C language</strong> too difficult? Want to try other programming languages? Then I recommend you learn <strong>Python</strong>. Currently, a free Python zero-based course worth 499 yuan is available for a limited time, with only 10 spots!
▲ Scan the QR code - Get it for free
