How Compilers Convert High-Level Languages to Assembly Language

For programmers who work daily with high-level programming languages such as Java, C++, C#, and Python, understanding how compilers convert high-level languages into assembly language helps us better comprehend computer programming.

Compilers convert high-level languages into assembly language primarily through three steps: lexical analysis, syntax analysis, and syntax tree parsing.

This article describes this process in a concise and easy-to-understand manner, ensuring that even those without a deep background in computer science can easily grasp the content.

Step 1: Lexical Analysis

Initially, programs written in high-level languages are just a string of individual characters to the compiler. To enable the compiler to recognize this string of characters, it must read the source program character by character and split it into meaningful words. These split words are represented in the compiler as <identifier, semantic value> pairs.

To sequentially identify words from the source program string, the compiler needs to have scanning capabilities, which can typically be implemented using a set of finite state machines. To illustrate what a finite state machine is, here is an example.

The following diagram shows a finite state machine that recognizes numbers, which consist of an integer part and an optional decimal part. Therefore, according to this finite state machine, both 250 and 3.14159 can be recognized as valid numbers.

Figure 1: Finite State Machine

The green nodes are marked with a circular symbol, indicating that they are “accepting” states. This means that as long as our state reaches a green node, the data we have read so far is a valid number. For example, starting from the ‘start’ state in the diagram, if the number we read is 42.15, the sequence of states we go through is (1,2,2,3,3,3). Since this series of states ends in an “accepting” state (which is state 3 in the diagram), we have successfully read a valid number. Moreover, the number 42.15 is represented in the form of <identifier, value> as <NUMBER,42.15>. Here, NUMBER indicates that the content we have read is a number, while the text “42.15” represents the corresponding semantic element value.

We can define different finite state machines for different types of words. For example, variable names can consist of letters and underscores, operators can include +=, ->, and keywords can be words like “if” and “while”, while types can be words like “int” and “char”. By constructing a small finite state machine for each type of word, we can ultimately create a large state machine that can accept different types of words. We can store our large state machine in a tabular format. Thus, we have constructed a large state machine that serves as an automatic scanner capable of recognizing various types of words.

At this point, the first step is complete, which we typically refer to as the “lexical analysis” phase.

After the high-level source program undergoes lexical analysis, the result we obtain is <identifier, value> pairs for subsequent processing.

For clarity, here is a simple example.

For instance, our program statement:

if ( x == 2 ) { x = a + b; }

After lexical analysis, we obtain the following <identifier, value> pairs:

(KEYWORD,”if”), (IDENTIFIER,”x”), (OPERATOR,”==”), (NUMBER,”2″), (DELIMITER, “{“), (IDENTIFIER,”x”), (OPERATOR,”=”), (IDENTIFIER,”a”), (OPERATOR,”+”), (IDENTIFIER,”b”), (DELIMITER,”;”), (DELIMITER,”}”).

Step 2: Syntax Analysis

After completing the “lexical analysis”, we move on to the exciting “syntax analysis” phase. Through syntax analysis, we obtain a syntax tree.

For example, for the program statement from the first step, if ( x == 2 ) { x = a + b; }, the resulting syntax tree is shown in the following diagram.

Figure 2: Syntax Tree

There are many methods to generate a syntax tree; here we will introduce the simplest method: predictive parsing. The specific approach is to start scanning from one end of the data stream, creating a temporary syntax tree with placeholders for all previously unencountered elements, and then sequentially reading subsequent data to fill in this syntax tree. (This may sound abstract, so please see the example below.)

For the syntax tree in Figure 2, the specific generation process is as follows:

Figure 3: Syntax Tree Generation Example

The subsequent statement parts can be generated similarly, which we will skip here.

Step 3: Translating the Syntax Tree

With the syntax tree in hand, the next task is to construct a symbol table to determine the storage locations of various elements.

The specific approach is to traverse the syntax tree, extracting the different variables from the syntax tree and placing them in available storage locations. The compiler decides how to allocate storage locations.

The specific process can be described using the diagram below:

Figure 4: Allocating Storage Addresses for Variables

Variables that appear multiple times in the syntax tree point to the same address. Traversing the syntax tree from top to bottom and left to right, when encountering an if node, we execute the operations related to if, and when encountering an assignment node, we execute the operations related to assignment. The detailed steps are as follows:

First, we look for the smallest expression, as shown in the green and blue circles in the diagram below, which represent a minimal expression.

Figure 5: Finding the Minimal Expression

Next, we merge the minimal expression with its surrounding expressions.

Figure 6: Merging Expressions

Finally, we orderly merge all expressions to obtain the final assembly language description, as shown in Figure 7.

Figure 7: Final Generated Assembly Language

Thus, we have translated high-level language into assembly language.

(My skills are limited; if there are any errors in the translation, please feel free to provide feedback. Thank you!)

This article is adapted from: Programmer’s Home

How Compilers Convert High-Level Languages to Assembly Language

WeChat ID: IdeaofSE

Leave a Comment