Building a Compiler with Rust and LLVM

Cyclang is a simple statically typed programming language that supports basic language constructs such as functions, control flow, and arithmetic operations. The language’s feature design is very concise, and the complete language specification can be referenced in the official documentation. The focus of this project is on the process of building a compiler from scratch using Rust and LLVM, rather than the complexity of the language itself.

Here is an example of Cyclang code:

fn fib(i32 n) -> i32 {  
    if (n < 2) {  
        return n;  
    }  
    return fib(n - 1) + fib(n - 2);  
}  
print(fib(20));

Why Choose Rust?

Rust is an ideal choice for building compilers for many reasons. A GitHub article explains in detail “Why Rust is the Most Popular Language for Developers.”

One of Rust’s main advantages is pattern matching. Rust’s exhaustive pattern matching ensures that all possible cases are handled at compile time, thus avoiding runtime errors. Here is a simple example:

enum Token {  
    Number(i64),  
    Plus,  
    Minus,  
    LeftParen,  
    RightParen,  
}

fn evaluate_token(token: Token) -> String {  
    match token {  
        Token::Number(n) => format!("Found number: {}", n),  
        Token::Plus | Token::Minus => "Found operator".to_string(),  
        Token::LeftParen => "Found opening parenthesis".to_string(),  
        Token::RightParen => "Found closing parenthesis".to_string(),  
        // The compiler ensures all cases are handled  
    }  
}

Rust’s type system and ownership model make it very suitable for building robust parsers and compilers. The combination of zero-cost abstractions, memory safety guarantees, and rich algebraic data types allows for clear and efficient expression of complex language constructs.

For example, it is easy to extend the token parser into a full expression evaluator:

enum Expr {  
    Binary(Box<Expr>, Token, Box<Expr>),  
    Literal(i64),  
    Group(Box<Expr>)  
}

fn evaluate(expr: &Expr) -> Result<i64, String> {  
    match expr {  
        Expr::Literal(n) => Ok(*n),  
        Expr::Binary(left, Token::Plus, right) =>   
            Ok(evaluate(left)? + evaluate(right)?),  
        Expr::Group(expr) => evaluate(expr),  
        _ => Err("Invalid expression".into())  
    }  
}

Architecture: From Source Code to Machine Code

Let’s take a closer look at how Cyclang converts code into executable instructions:

Building a Compiler with Rust and LLVM
Compiler Architecture

Parsing with Pest

The first step of the compiler is to parse the source code into a structured format. For Cyclang, I chose the Pest parser, a modern Rust parsing framework. I had previously used LALRPOP, but Pest provides a more intuitive way to define syntax.

The parsing process mainly consists of two parts:

  • A grammar file (cyclo.pest) that defines the syntax rules of the language.
  • A parser implementation that converts the syntax rules into an abstract syntax tree (AST).

During development, I found that as the complexity of the language increased, maintaining the grammar file became increasingly difficult. In hindsight, writing a parser by hand might offer more flexibility, but Pest’s declarative approach is very suitable for rapid prototyping.

Integrating LLVM

The most interesting (and challenging) part of building Cyclang is the integration with LLVM. I chose to use llvm-sys, which provides raw Rust bindings for LLVM’s C API.

Why Choose LLVM?

LLVM (Low Level Virtual Machine) is a compiler infrastructure that provides the following features:

  • Powerful intermediate representation (IR) that abstracts away machine-specific details.
  • Advanced optimization capabilities.
  • Support for multiple target architectures.
  • Just-in-time (JIT) compilation capabilities.

Using the LLVM API

Using LLVM’s C API directly through llvm-sys has its pros and cons:

Pros:

  • Deep understanding of LLVM’s architecture.
  • Fine control over IR generation.
  • Direct access to LLVM’s full feature set.

Cons:

  • Steep learning curve.
  • Code can be verbose even for simple operations.
  • Manual memory management required.

To manage complexity, I organized the LLVM-related code into a dedicated code generation module.

Implementation Highlights

Translation of Control Flow

An interesting implementation detail is how high-level control structures are translated into LLVM IR. For example, Cyclang’s <span>for</span> loop is actually “degraded” to a <span>while</span> loop at the IR level:

// Cyclang's for loop  
for (i32 i = 0; i < 10; i = i + 1) {  
    print(i);  
}

// Translated to equivalent IR logic:  
{  
    i32 i = 0;  
    while (i < 10) {  
        print(i);  
        i = i + 1;  
    }  
}

The complete implementation can be referenced in builder.rs.

Handling Complex Types

Implementing complex types (such as strings) is one of the most challenging parts. LLVM operates at a very low level, and even basic string operations require careful memory management and pointer manipulation. You can refer to this document for related LLVM IR implementations.

To simplify the implementation of complex operations, I adopted a hybrid approach: generating LLVM bitcode through C files to handle complex operations, and then loading and linking it with the remaining IR. This approach significantly simplifies the implementation of the standard library.

Gains and Reflections

Parser Implementation

While Pest is a powerful parsing library, I plan to implement a handwritten parser in my next project. This will deepen my understanding of parsing techniques (especially recursive descent parsing and precedence parsing) and provide finer control over the parsing process.

Test Automation

Although Cyclang has comprehensive test coverage, I realized that file-based test generation is more efficient than manually writing separate Rust test cases. This approach makes the test suite easier to maintain and extend.

LLVM Integration

Direct integration with the LLVM API increased the complexity of the code generation module. In future projects, I plan to abstract this part into a standalone library or use existing solutions (like Inkwell) to improve code organization and maintainability.

Conclusion

Cyclang started as a toy project to learn Rust and LLVM, gradually evolving into a journey exploring compiler design. Although the project is small in scale, building it from scratch has taught me valuable lessons about language design, low-level programming, and the complexities of modern compilers. I hope my sharing can provide useful insights to readers interested in similar explorations.

Click to follow and scan the code to join the group chat
Get free learning materials for

Leave a Comment