In-Depth Analysis of the GCC Compilation Process: From Preprocessing to Linking

This article is generated by Tencent Yuanbao, with content outline and verification provided by @Xiao Hui~

The compilation process of GCC (GNU Compiler Collection) is a complex multi-stage process that converts human-readable source code into machine-executable object code. This process mainly includes preprocessing, compilation (which can be further divided into five sub-stages), assembly, and linking. Understanding these stages and their implementation in GCC is crucial for optimizing code performance and debugging.

1. Preprocessing

Preprocessing is the first step in the compilation process, primarily performed by the preprocessor (<span>cpp</span>). It handles preprocessing directives in the source code (which start with <span>#</span>) and prepares for the subsequent compilation stages.

Preprocessing commands in GCC:

gcc -E source.c -o source.i
  • <span>-E</span> option instructs GCC to perform only preprocessing.
  • The output after preprocessing is usually saved as a <span>.i</span> file (for C++, it is <span>.ii</span>).

Main tasks of the preprocessing stage:

  1. Macro Expansion: All macros defined with <span>#define</span> will be expanded and replaced.
  2. Conditional Compilation: Based on conditional compilation directives such as <span>#if</span>, <span>#ifdef</span>, <span>#ifndef</span>, <span>#elif</span>, <span>#else</span>, <span>#endif</span>, specific code blocks will be included or excluded.
  3. Header File Inclusion: The <span>#include</span> directive will recursively insert the contents of the specified header file at the location of the directive.
  4. Special Symbol Handling: Handles predefined macros such as <span>__LINE__</span>, <span>__FILE__</span>, <span>__func__</span>, <span>__DATE__</span>, <span>__TIME__</span>, replacing them with their corresponding values.
  5. Comment Removal: All comments (<span>//</span> and <span>/* ... */</span>) will be removed.
  6. Line Marking: Generates <span>#line</span> directives to help the compiler and debugger trace the original location of the code.

Precompiled Headers (PCH): To improve compilation speed for large projects, GCC supports precompiled headers. The core idea is to precompile stable and widely referenced header files (such as standard library headers) into an intermediate format (GCC generates <span>.gch</span> files), which can be directly used in subsequent compilations to avoid re-parsing the same header files.

  • Creating PCH: For example, <span>g++ -x c++-header stdafx.h -o stdafx.h.gch</span>
  • Using PCH: GCC will automatically look for the corresponding <span>.gch</span> file in the default header file search paths.

2. Compilation

After preprocessing, GCC hands the <span>.i</span> file to the compiler itself (for example, <span>cc1</span> for C, <span>cc1plus</span> for C++) for compilation. This stage is responsible for converting high-level language code into assembly code, which can be further divided into five sub-stages.

Compilation commands in GCC:

gcc -S source.i -o source.s
  • <span>-S</span> option instructs GCC to compile and generate assembly code, then stop.
  • The output is an assembly language file, usually with a <span>.s</span> extension.

2.1 Lexical Analysis

The lexical analyzer (scanner) reads the preprocessed character stream and breaks it down into a series of meaningful lexical units (Tokens), such as keywords (<span>int</span>, <span>if</span>), identifiers (variable names, function names), constants, operators, etc. It ignores whitespace characters and comments.

  • Implementation in GCC: GCC uses a hand-written lexical analyzer.

2.2 Syntax Analysis

The syntax analyzer (parser) combines the sequence of Tokens produced by lexical analysis into various syntactic structures (such as expressions, statements, function definitions, etc.) according to the language’s syntax rules, and constructs an Abstract Syntax Tree (AST) that reflects the hierarchical syntactic structure of the source code.

  • Implementation in GCC: GCC uses a parser based on CFG (Context-Free Grammar).

2.3 Semantic Analysis

The semantic analyzer checks the AST for context-related properties to ensure the logical validity of the program.

  • Main tasks:
    • Type Checking: Ensures that the operand types of operators are compatible, function call parameters match their declarations, etc.
    • Identifier Resolution: Associates variable and function references with their declarations.
    • Control Flow Checking: Ensures that statements like <span>break</span>, <span>continue</span> are located in appropriate loop or switch contexts.
  • This stage will attach semantic information such as types to the AST.

2.4 Intermediate Code Generation & Optimization

The compiler converts the AST into an Intermediate Representation (IR) that is independent of hardware architecture. GCC uses a form of three-address code called GIMPLE. The IR provides a good platform for code optimization.

  • Optimization: The compiler performs various optimizations on the IR, such as constant propagation, dead code elimination, loop optimization, inline expansion, etc., aimed at improving the runtime efficiency of the code or reducing its size.
  • GCC command: The <span>-fdump-tree-all</span> option can be used to output the IR at various optimization stages for debugging.

2.5 Target Code Generation

The code generator converts the optimized IR into assembly code for the target machine. It is responsible for instruction selection, register allocation, and instruction scheduling.

  • Instruction Selection: Maps IR operations to the instruction set of the target CPU.
  • Register Allocation: Decides which values are stored in the limited hardware registers and which need to be spilled to memory.
  • Instruction Scheduling: Reorders instructions to fully utilize the CPU’s pipeline and reduce stalls.

3. Assembly

The assembly stage translates the assembly code generated by the compiler (<span>.s</span> file) into relocatable object code (Relocatable Object Code). This step is performed by the assembler (<span>as</span>).

Assembly commands in GCC:

gcc -c source.s -o source.o
# Or start directly from the source code, completing preprocessing, compilation, and assembly, then stopping
gcc -c source.c -o source.o
  • <span>-c</span> option instructs GCC to compile or assemble the source file but not link.
  • The output is an object file (<span>.o</span> or <span>.obj</span>), which contains machine instructions and data, but the addresses are not yet finalized.

Object File Format: On Unix-like systems, object files typically use the ELF (Executable and Linkable Format) format, which contains:

  • Code Segment (.text): Contains the compiled machine instructions.
  • Data Segment (.data, .bss, .rodata): Contains initialized global/static variables, uninitialized variables, read-only constants, etc.
  • Symbol Table: Records the global symbols (function, variable names) defined and referenced in the file and their attributes.
  • Relocation Information: Records the address information that needs to be modified during linking.

4. Linking

Linking is the final step of the compilation process, performed by the linker (<span>ld</span>). It combines one or more object files and the required library files (static or dynamic libraries), resolves their symbol references, and ultimately generates an executable file or shared library.

Linking commands in GCC:

gcc source1.o source2.o -o executable -lm
  • GCC automatically invokes the linker and links all <span>.o</span> files and specified libraries (e.g., <span>-lm</span> indicates the math library).

Main tasks of the linking stage:

  1. Symbol Resolution: The linker scans all object files to find the corresponding definitions for each symbol reference. Symbols can be functions or global variables. Any unresolved symbol references will result in an error at this stage.
  2. Relocation:
  • Merging Sections: Merges sections of the same type (e.g., <span>.text</span>, <span>.data</span>) from different object files.
  • Relocation Entry: Modifies the addresses of symbol references in code and data according to the relocation information, making them point to the correct memory locations.

Types of Linking:

  • Static Linking:
    • Directly embeds copies of library code into the final executable file at compile time.
    • Advantages: The executable file is highly independent and does not require external libraries at runtime.
    • Disadvantages: The file size is large, and memory is wasted when multiple programs share the same library; updating the library requires recompiling the program.
    • GCC Usage: By default, links the shared version of the standard library; use the <span>-static</span> option to enforce static linking.
  • Dynamic Linking (Shared Linking):
    • The executable file only records the names and symbol information of the required shared libraries, without including the actual library code.
    • At Runtime: The operating system’s dynamic linker (e.g., <span>ld-linux.so</span>) loads the required shared libraries (<span>.so</span> files) into memory and completes the final address relocation.
    • Advantages: Significantly reduces the size of the executable file, allowing multiple programs to share the same library code in memory; updating the library does not require recompiling all programs.
    • Disadvantages: The executable file depends on the external environment, and incompatible library versions may cause issues.
    • GCC Usage: Use the <span>-l<library></span> option to link shared libraries, and use <span>-L</span> to specify additional library search paths.

5. Summary: Overview of the GCC Compilation Process

A complete GCC compilation process can be intuitively represented by the following commands:

# 1. Preprocessing: Generate hello.i
gcc -E hello.c -o hello.i
# 2. Compilation: Generate hello.s
gcc -S hello.i -o hello.s
# 3. Assembly: Generate hello.o
gcc -c hello.s -o hello.o
# 4. Linking: Generate the final executable file hello
gcc hello.o -o hello

# A more common approach is to do it in one step:
gcc hello.c -o hello

The GCC compilation process is a precise pipeline, with each stage performing its role while closely collaborating. Preprocessing prepares “clean” code for compilation; the core five stages refine high-level language into low-level assembly; the assembler converts symbolic assembly instructions into machine code; and the linker ultimately weaves the scattered modules into a complete executable whole. Understanding this process is crucial for cross-platform development, performance optimization, and debugging complex issues.

Leave a Comment