C# Compiler Tutorial (1): Exploring Lexical Analysis
Hello everyone! Today I want to explore a super interesting topic with you all — developing a simple compiler in C#! As the first tutorial, we will start with the basics of lexical analysis. Don’t be intimidated by the word “compiler”; follow me step by step, and you’ll find that compilers aren’t so mysterious after all~
Setting Up The Development Environment
First, we need to prepare the development environment. We will use:
-
Visual Studio 2022 Community Edition
-
.NET 6.0 or higher
-
NUnit (for unit testing)
What Is Lexical Analysis?
Lexical analysis is like breaking sentences into individual words when we read an article. For example, when we see<span>int age = 18;</span>
this line of code, the compiler breaks it down into:
-
Keyword:
<span>int</span>
-
Identifier:
<span>age</span>
-
Operator:
<span>=</span>
-
Number:
<span>18</span>
-
Delimiter:
<span>;</span>
These extracted “words” are referred to as “tokens” in compiler theory.
Designing The Token Class
Let’s first create a class that represents a token:
public enum TokenType
{
// Keywords
Int,
// Operators
Plus,
Minus,
Assign,
// Identifiers and literals
Identifier,
Number,
// Delimiters
Semicolon,
EOF // End of file marker
}
public class Token
{
public TokenType Type { get; }
public string Value { get; }
public int Line { get; }
public int Column { get; }
public Token(TokenType type, string value, int line, int column)
{
Type = type;
Value = value;
Line = line;
Column = column;
}
public override string ToString()
{
return $"Token({Type}, '{Value}') at Line {Line}, Column {Column}";
}
}
Tip: We added line and column information in the Token class, which is very helpful for later error reporting!
Implementing The Lexer
Next, let’s implement a simple lexer:
public class Lexer
{
private readonly string _input;
private int _position;
private int _line = 1;
private int _column = 1;
public Lexer(string input)
{
_input = input;
_position = 0;
}
private char CurrentChar => _position < _input.Length ? _input[_position] : '\0';
private void Advance()
{
_position++;
_column++;
}
public Token GetNextToken()
{
while (CurrentChar != '\0')
{
// Skip whitespace characters
if (char.IsWhiteSpace(CurrentChar))
{
SkipWhitespace();
continue;
}
// Recognize numbers
if (char.IsDigit(CurrentChar))
{
return ReadNumber();
}
// Recognize identifiers and keywords
if (char.IsLetter(CurrentChar))
{
return ReadIdentifier();
}
// Recognize operators and delimiters
switch (CurrentChar)
{
case '+':
Advance();
return new Token(TokenType.Plus, "+", _line, _column - 1);
case '=':
Advance();
return new Token(TokenType.Assign, "=", _line, _column - 1);
case ';':
Advance();
return new Token(TokenType.Semicolon, ";", _line, _column - 1);
default:
throw new Exception($"Unrecognized character: {CurrentChar}");
}
}
return new Token(TokenType.EOF, "", _line, _column);
}
private Token ReadNumber()
{
var startColumn = _column;
var result = "";
while (char.IsDigit(CurrentChar))
{
result += CurrentChar;
Advance();
}
return new Token(TokenType.Number, result, _line, startColumn);
}
private Token ReadIdentifier()
{
var startColumn = _column;
var result = "";
while (char.IsLetterOrDigit(CurrentChar))
{
result += CurrentChar;
Advance();
}
// Check if it's a keyword
if (result == "int")
{
return new Token(TokenType.Int, result, _line, startColumn);
}
return new Token(TokenType.Identifier, result, _line, startColumn);
}
private void SkipWhitespace()
{
while (CurrentChar != '\0' && char.IsWhiteSpace(CurrentChar))
{
if (CurrentChar == '\n')
{
_line++;
_column = 0;
}
Advance();
}
}
}
Testing Our Lexer
To ensure our lexer works correctly, we write a simple test:
using NUnit.Framework;
[TestFixture]
public class LexerTests
{
[Test]
public void TestSimpleExpression()
{
var input = "int age = 18;";
var lexer = new Lexer(input);
var tokens = new List<Token>();
Token token;
while ((token = lexer.GetNextToken()).Type != TokenType.EOF)
{
tokens.Add(token);
}
Assert.That(tokens.Count, Is.EqualTo(5));
Assert.That(tokens[0].Type, Is.EqualTo(TokenType.Int));
Assert.That(tokens[1].Type, Is.EqualTo(TokenType.Identifier));
Assert.That(tokens[2].Type, Is.EqualTo(TokenType.Assign));
Assert.That(tokens[3].Type, Is.EqualTo(TokenType.Number));
Assert.That(tokens[4].Type, Is.EqualTo(TokenType.Semicolon));
}
}
Notes:
-
Currently, our lexer is quite simple and can only handle basic integers, identifiers, and a few simple operators.
-
In a real compiler, we also need to handle more cases, such as floating-point numbers, strings, multi-line comments, etc.
-
Error handling also needs to be more robust, such as handling illegal characters, number overflow, etc.
Performance Optimization Tips
If you need to handle large files, consider the following optimizations:
-
Use StringBuilder instead of string concatenation
-
Use character buffers to reduce IO operations
-
Use lookup tables (Dictionary) to optimize keyword checking
Friends, our C# learning journey ends here today! Remember to write code, and feel free to ask questions in the comments. Lexical analysis is just the first step in the compiler; next, we will learn about syntax analysis, semantic analysis, and other exciting topics. Happy learning, and may your journey in C# development be ever-growing! Code changes the world, see you in the next issue!