Understanding Compiler Principles for Everyone

Mathematics Algorithm Club

Date ： December 25, 2021

Total Words ： 4255 words

Source ： Job Online

Understanding the internal principles of compilers can help you utilize them more efficiently. Gradually delve into how programming languages and compilers work according to the order of compilation. This article contains numerous links, sample code, and charts to assist your understanding of compilers.

Author’s Note:

This is a reprint of my second article on Medium, which had over 21,000 views in the previous version. I am glad to have been able to assist your learning, so I have completely rewritten it based on the comments from the previous version.

I have chosen Rust as the primary language for this article. It is a detailed, efficient, modern language that seems specifically designed to make compiler design simple. I enjoy using it. https://www.rust-lang.org/

The purpose of writing this article is primarily to attract readers’ attention rather than provide 20 pages of mind-numbing reading material. For those deeper topics you are interested in, there are many links in the article that will guide you to related materials. Most links lead to Wikipedia.

Thank you for your attention, and I hope you enjoy this article that I spent over 20 hours writing. Feel free to leave any questions or suggestions in the comments section at the bottom of the article.

A Brief Introduction

What is a Compiler?

The programming language you speak of is essentially just a piece of software called a compiler, which reads a text file, processes it extensively, and ultimately produces a binary file. The language part of the compiler is the text style it processes. Since computers can only read 1s and 0s, and writing Rust programs is much simpler than writing binary programs directly, compilers are used to convert human-readable text into machine code that computers can recognize.

A compiler can be any program that converts a text file into another file. For example, here is a compiler written in Rust that converts 0s to 1s and 1s to 0s:

// An example compiler that turns 0s into 1s, and 1s into 0s.

fn main(){

let input = “1 0 1 A 1 0 1 3”;

// iterate over every character `c` in input

let output: String = input.chars().map(|c|

ifc == ‘1’{‘0’}

elseifc == ‘0’{‘1’}

else{c}// if not 0 or 1, leave it alone

).collect();

println!(“{}”,output);// 0 1 0 A 0 1 0 3

}

What Does a Compiler Do?

In short, a compiler takes source code and produces a binary file. Because transforming complex, human-readable code directly into 0/1 binary is complicated, the compiler goes through several steps before producing a runnable program:

Read individual tokens from the source code you provide.
Classify these tokens into words, numbers, symbols, and operators.
Use pattern matching to identify operators from the classified tokens, clarify what operations those operators want to perform, and produce an operator tree (expression tree).
The final step traverses all operators in the expression tree to produce the corresponding binary data.

Although I said the compiler directly converts from the expression tree to binary, it actually produces assembly code first, which is then assembled/compiled into binary data. The assembler is like a high-level, human-readable binary. More reading material about assembly language can be found here.

What is an Interpreter?

An interpreter is very similar to a compiler; it also reads code written in a programming language and processes it. However, the interpreter skips code generation and compiles and executes the AST immediately. The biggest advantage of an interpreter is the time saved during program execution while debugging. A compiler may take anywhere from a second to several minutes to compile a program, whereas an interpreter can start executing the program immediately without compilation. The biggest disadvantage of an interpreter is that it must be installed on the user’s computer for the program to execute.

Although this article primarily focuses on compilers, it is crucial to clarify the differences between compilers and interpreters as well as compiler-related content.

1. Lexical Analysis

The first step is to break the input down token by token. This step is called lexical analysis, or tokenization. The key here is we combine characters into the words, identifiers, symbols, etc., that we need. Lexical analysis usually does not need to process logical operations like calculating 2+2 – in fact, this expression only has three tokens: one number:2, one plus sign, and another number:2.

Let’s assume you are parsing a string like 12+3: it will read the characters 1, 2, +, and 3. We have split these characters, but now we must combine them; this is one of the main tasks of the tokenizer. For example, we have two separate characters 1 and 2, but we need to put them together and parse them as an integer. As for +, it also needs to be recognized as a plus sign, not its character value – the character value is 43.

If you can read the code above and understand what it means, the next Rust tokenizer will combine the numbers into 32-bit integers, with the plus sign being the final token value Plus.

rust playground

You can click the “Run” button at the top left of the Rust playground to compile and execute the code in your browser.

In a programming language’s compiler, the lexical analyzer may need many different types of tokens. For example: symbols, numbers, identifiers, strings, operators, etc. What tokens to extract from the source file entirely depends on the programming language itself.

intmain(){

inta;

intb;

a = b = 4;

returna – b;

}

Scanner production:

[Keyword(Int),Id(“main”),Symbol(LParen),Symbol(RParen),Symbol(LBrace),Keyword(Int),Id(“a”),Symbol(Semicolon),Keyword(Int),Id(“b”),Symbol(Semicolon),Id(“a”),Operator(Assignment),Id(“b”),

Operator(Assignment),Integer(4),Symbol(Semicolon),Keyword(Return),Id(“a”),Operator(Minus),Id(“b”),Symbol(Semicolon),Symbol(RBrace)]

The sample code in C has undergone lexical analysis and output its tokens

2. Parsing

The parser is indeed the core of syntax parsing. The parser extracts tokens produced by the lexical analyzer and attempts to determine if they conform to specific patterns, then associates these patterns with expressions like function calls, variable calls, and mathematical operations. The parser defines the syntax of the programming language token by token.

int a = 3 and a: int = 3 differ in how the parser processes them. The parser determines how the syntax is structured externally. It ensures that parentheses and braces are balanced in number, each statement ends with a semicolon, and each function has a name. When tokens do not conform to the expected patterns, the parser knows that the order of the tokens is incorrect.

You can write several different types of parsers. One of the most common types is a top-down, recursive descent parser. Recursive descent parsers are the simplest and easiest to understand to use. All the parser examples I have written are based on recursive descent.

The syntax that the parser parses can be represented using a grammar. A grammar like EBNF can describe how a parser parses simple mathematical operations, like this 12+3 :

expr = additive_expr;

additive_expr = term,(‘+’ | ‘-‘),term;

term = number;

The EBNF syntax for simple addition and subtraction expressions

Keep in mind that a grammar file is not a parser, but it is indeed a form of representation for a parser. You can create a parser around the grammar above. The grammar file can be used by people and is much simpler than directly reading and understanding the parser’s code.

That type of grammar parser should be expr parser since it relates directly to everything at the top level. The only valid input must be any number, plus, or minus, followed by any number.expr requires an additive_expr, which mainly appears in addition and subtraction expressions.additive_expr first needs a term (a number), followed by a plus or minus, and finally another term.

The sample AST generated from parsing 12+3

The tree structure produced by the parser during parsing is called an Abstract Syntax Tree, or AST. The AST contains everything that needs to be operated on. The parser does not compute these operations; it merely collects the tokens in the correct order.

I previously supplemented our lexical analyzer code so that it matches our grammar and can produce an AST like the chart. I marked the beginning and end of the new parser code with the comments // BEGIN PARSER // and // END PARSER //.

rust playground

We can delve deeper. Suppose we want to support inputs with only numbers and no operators, or add division and multiplication, or even add precedence. With just a simple modification to the grammar file, all of this is entirely possible, and any adjustments will directly reflect in our parser code.

expr = additive_expr;

additive_expr = multiplicative_expr,{(‘+’ | ‘-‘),multiplicative_expr};

multiplicative_expr = term,{(“*” | “/”),term};

term = number;

The new syntax

https://play.rust-lang.org/?gist=1587a5dd6109f70cafe68818a8c1a883&version=nightly&mode=debug&edition=2018

A parser designed for C language syntax (also known as a lexical analyzer) and parser example. From the character sequence beginning with “if(net>0.0)total+=net(1.0+tax/100.0);”, the scanner forms a series of tokens and categorizes them as identifiers, reserved words, numbers, or operators. The latter sequence is converted into a syntax tree by the parser, which is then processed in stages by other parts of the compiler. The scanner and parser handle the rules and context-free parts of C syntax, respectively. Quoted from: Jochen Burghardt. Source.

3. Code Generation

The code generator receives an AST and then generates the corresponding code or assembly code. The code generator must traverse everything in the AST in a recursive downward order – just like the parser works – and then generate the corresponding content, except here the generated content is code, not a syntax tree.

https://godbolt.org/z/K8416_

If you copy and open the link above, you can see the assembly code generated by the sample code on the left. The third and fourth lines of the assembly code show how the compiler generates corresponding code for constants when it encounters them in the AST.

Godbolt Compiler Explorer is a great tool that allows you to write code in a high-level language and see the assembly code it produces. You might get a bit dizzy wondering what kind of code is generated, but don’t forget to add optimization options to your programming language compiler to see how smart it really is. (For Rust, it’s -O )

If you are interested in how compilers save a local variable to memory in assembly language, this article (“Code Generation” section) explains the stack in detail. Most of the time, when variables are not local, high-level compilers allocate space for variables in the heap and save them in the heap instead of the stack. You can read more about variable storage from this StackOverflow answer.

Since assembly is a completely different and complex topic, I won’t discuss it too much here. I just want to emphasize the importance of the code generator and its role. Furthermore, code generators can produce not only assembly code. The Haxe compiler has a backend that can produce over six different programming languages: including C++, Java, and Python.

The backend refers to the code generator or expression parser of the compiler; thus, the frontend is the lexical analyzer and parser. There is also a middle end, which is usually related to optimization and IR, which will be explained later. The backend is typically independent of the frontend; the backend only cares about the AST it receives. This means that the same backend can be reused for several different frontends or languages. The famous GNU Compiler Collection is an example of this.

I can’t find a better example of a code generator than my C compiler backend; you can check it out here.

After generating assembly code, this assembly code is written to a new assembly file (

<span>.s</span><span> or </span><code><span>.asm</span><span>). Then this file is passed to the assembler, which is the compiler for assembly language, which generates the corresponding binary code. After that, this binary code is written to a new object file (</span><code><span>.o</span><span>). </span>

The object file is machine code, but it cannot be executed. To turn them into an executable file, the object files need to be linked together. The linker reads the common machine code and turns it into an executable file, shared library, or static library. More knowledge about linkers can be found here.

Linkers differ depending on the operating system. Any third-party linker should be able to compile the object code produced by your backend. Therefore, there is no need to create your own linker when writing a compiler.

A compiler may have an intermediate representation, or IR for short. IR is mainly used to represent the original instructions losslessly during optimization or translation to another language. IR is no longer the original code; it is a lossless simplification for finding potential optimizations in the code. Loop unrolling and vectorization are both accomplished using IR. More about optimizations related to IR can be found in this PDF.

Conclusion

When you understand compilers, you can use your programming language more effectively. Perhaps one day you will be interested in creating your own programming language? I hope this helps you.

— THE END —

Understanding Compiler Principles for Everyone

☞ A letter from Zhang Wenhong to young people, please check it out

☞ Wu Jun talks about American higher education: why do Asians have high admission scores but not many excel in university?

☞ The first “post-90s” county mayor in the country appears

☞ A post-90s doctoral supervisor at a Double First-Class university: what is it like to conduct research with PhD students nearly the same age?

☞ How big is the technological gap between China and the US?

☞ He Kaiming’s ResNet paper has just surpassed 100,000 citations