C# Compiler Tutorial (1): Exploring Lexical Analysis

Hello everyone! Today I want to explore a super interesting topic with you all — developing a simple compiler in C#! As the first tutorial, we will start with the basics of lexical analysis. Don’t be intimidated by the word “compiler”; follow me step by step, and you’ll find that compilers aren’t so mysterious after all~

Setting Up The Development Environment

First, we need to prepare the development environment. We will use:

Visual Studio 2022 Community Edition
.NET 6.0 or higher
NUnit (for unit testing)

What Is Lexical Analysis?

Lexical analysis is like breaking sentences into individual words when we read an article. For example, when we seeint age = 18; this line of code, the compiler breaks it down into:

Keyword:int
Identifier:age
Operator:=
Number:18
Delimiter:;

These extracted “words” are referred to as “tokens” in compiler theory.

Designing The Token Class

Let’s first create a class that represents a token:

public enum TokenType
{
    // Keywords
    Int,
    // Operators
    Plus,
    Minus,
    Assign,
    // Identifiers and literals
    Identifier,
    Number,
    // Delimiters
    Semicolon,
    EOF  // End of file marker
}

public class Token
{
    public TokenType Type { get; }
    public string Value { get; }
    public int Line { get; }
    public int Column { get; }

    public Token(TokenType type, string value, int line, int column)
    {
        Type = type;
        Value = value;
        Line = line;
        Column = column;
    }

    public override string ToString()
    {
        return $"Token({Type}, '{Value}') at Line {Line}, Column {Column}";
    }
}

Tip: We added line and column information in the Token class, which is very helpful for later error reporting!

Implementing The Lexer

Next, let’s implement a simple lexer:

public class Lexer
{
    private readonly string _input;
    private int _position;
    private int _line = 1;
    private int _column = 1;

    public Lexer(string input)
    {
        _input = input;
        _position = 0;
    }

    private char CurrentChar =&gt; _position &lt; _input.Length ? _input[_position] : '\0';

    private void Advance()
    {
        _position++;
        _column++;
    }

    public Token GetNextToken()
    {
        while (CurrentChar != '\0')
        {
            // Skip whitespace characters
            if (char.IsWhiteSpace(CurrentChar))
            {
                SkipWhitespace();
                continue;
            }

            // Recognize numbers
            if (char.IsDigit(CurrentChar))
            {
                return ReadNumber();
            }

            // Recognize identifiers and keywords
            if (char.IsLetter(CurrentChar))
            {
                return ReadIdentifier();
            }

            // Recognize operators and delimiters
            switch (CurrentChar)
            {
                case '+':
                    Advance();
                    return new Token(TokenType.Plus, "+", _line, _column - 1);
                case '=':
                    Advance();
                    return new Token(TokenType.Assign, "=", _line, _column - 1);
                case ';':
                    Advance();
                    return new Token(TokenType.Semicolon, ";", _line, _column - 1);
                default:
                    throw new Exception($"Unrecognized character: {CurrentChar}");
            }
        }

        return new Token(TokenType.EOF, "", _line, _column);
    }

    private Token ReadNumber()
    {
        var startColumn = _column;
        var result = "";
        
        while (char.IsDigit(CurrentChar))
        {
            result += CurrentChar;
            Advance();
        }

        return new Token(TokenType.Number, result, _line, startColumn);
    }

    private Token ReadIdentifier()
    {
        var startColumn = _column;
        var result = "";
        
        while (char.IsLetterOrDigit(CurrentChar))
        {
            result += CurrentChar;
            Advance();
        }

        // Check if it's a keyword
        if (result == "int")
        {
            return new Token(TokenType.Int, result, _line, startColumn);
        }

        return new Token(TokenType.Identifier, result, _line, startColumn);
    }

    private void SkipWhitespace()
    {
        while (CurrentChar != '\0' &amp;&amp; char.IsWhiteSpace(CurrentChar))
        {
            if (CurrentChar == '\n')
            {
                _line++;
                _column = 0;
            }
            Advance();
        }
    }
}

Testing Our Lexer

To ensure our lexer works correctly, we write a simple test:

using NUnit.Framework;

[TestFixture]
public class LexerTests
{
    [Test]
    public void TestSimpleExpression()
    {
        var input = "int age = 18;";
        var lexer = new Lexer(input);
        var tokens = new List<Token>();

        Token token;
        while ((token = lexer.GetNextToken()).Type != TokenType.EOF)
        {
            tokens.Add(token);
        }

        Assert.That(tokens.Count, Is.EqualTo(5));
        Assert.That(tokens[0].Type, Is.EqualTo(TokenType.Int));
        Assert.That(tokens[1].Type, Is.EqualTo(TokenType.Identifier));
        Assert.That(tokens[2].Type, Is.EqualTo(TokenType.Assign));
        Assert.That(tokens[3].Type, Is.EqualTo(TokenType.Number));
        Assert.That(tokens[4].Type, Is.EqualTo(TokenType.Semicolon));
    }
}

Notes:

Currently, our lexer is quite simple and can only handle basic integers, identifiers, and a few simple operators.
In a real compiler, we also need to handle more cases, such as floating-point numbers, strings, multi-line comments, etc.
Error handling also needs to be more robust, such as handling illegal characters, number overflow, etc.

Performance Optimization Tips

If you need to handle large files, consider the following optimizations:

Use StringBuilder instead of string concatenation
Use character buffers to reduce IO operations
Use lookup tables (Dictionary) to optimize keyword checking

Friends, our C# learning journey ends here today! Remember to write code, and feel free to ask questions in the comments. Lexical analysis is just the first step in the compiler; next, we will learn about syntax analysis, semantic analysis, and other exciting topics. Happy learning, and may your journey in C# development be ever-growing! Code changes the world, see you in the next issue!

Building A C# Compiler: Lexical Analysis To Code Generation

C# Compiler Tutorial (1): Exploring Lexical Analysis

Setting Up The Development Environment

What Is Lexical Analysis?

Designing The Token Class

Implementing The Lexer

Testing Our Lexer

Performance Optimization Tips

Leave a Comment Cancel reply

C# Compiler Tutorial (1): Exploring Lexical Analysis

Setting Up The Development Environment

What Is Lexical Analysis?

Designing The Token Class

Implementing The Lexer

Testing Our Lexer

Performance Optimization Tips

Related posts

Leave a Comment Cancel reply