Building from Scratch: Implementing Core Logic for Large Language Model Inference in C++

In the current era of large language models (LLMs), building an efficient inference framework from scratch allows us to gain a deeper understanding of the underlying logic of AI-generated content. C++, with its close-to-hardware and low-overhead characteristics, has become the preferred language for implementing lightweight LLM inference. This article will integrate the core design ideas of the framework with code implementation details, breaking down the complete path from principles to implementation in simple terms, so you can understand both “why it is designed this way” and “how it is implemented in code.”

1. Why Choose C++? The Logic Behind Language Selection for Inference Frameworks

The core requirements for LLM inference are “speed” and “efficiency”—rapidly processing massive matrix operations while minimizing memory and hardware resource usage. C++ perfectly meets these needs:

  • No interpreter overhead, allowing direct memory manipulation, with execution efficiency for core logic like matrix operations and loops far exceeding that of high-level languages;
  • Flexible hardware adaptation, capable of directly invoking CPU instruction sets or GPU CUDA interfaces, catering to both general devices and high-performance acceleration scenarios;
  • Small compiled size and few dependencies, allowing it to be packaged into standalone executable files for easy deployment across different devices.

In simple terms, C++ enables the inference framework to “run, run fast, and run anywhere,” making it an ideal vehicle for lightweight LLM inference.

2. Core Design: Four Modules Supporting the Inference Framework

A functional LLM inference framework does not require a complex and redundant structure; it consists of four core modules. Each module performs its specific role, completing the entire process from “text input → numerical conversion → inference computation → text output.”

1. Tokenizer: The “Translator” Between Language and Model

The model only recognizes numbers, not human language. The core task of the tokenizer is to translate text into a “numerical code” (token ID) that the model can understand, and then convert the numbers back to text after inference.

Core Principle: BPE Tokenization (High-Frequency Combination Priority)

Tokenization is essentially a game of “splitting and merging”: first, split the text into the smallest units (individual Chinese characters, English letters + symbols), then count high-frequency combinations (e.g., “artificial intelligence”, “LLM”), merging them into a fixed vocabulary, ultimately forming a vocabulary list and assigning unique numbers. For example, “Hello, my name is” would be converted into a token ID sequence like [31373, 1175, 1403, 318].

Code Implementation

  • Data Structure: Use <span>unordered_map</span> to store the vocabulary (string → token ID) and the reverse vocabulary (token ID → string), with the hash table’s O(1) lookup speed ensuring tokenization efficiency; use <span>vector</span> to store BPE merging rules, merging high-frequency substrings in order of priority.
  • Core Functions: <span>load</span> loads the vocabulary and merging rules, <span>encode</span> completes the conversion from text to token ID, and <span>decode</span> implements the reverse conversion.
  • Key Optimization: Use regular expressions to pre-compile text matching rules, quickly distinguishing letters, numbers, and symbols to avoid tokenization confusion; reuse memory throughout to reduce the overhead of frequent allocation and deallocation.

2. Model Loader: The “Mover” of Parameters

The LLM models we download online (such as GPT-2, Llama series) are essentially binary files storing neural network weights, biases, and other parameters. The loader’s role is to unpack and organize this “raw data,” converting it into matrices and vectors that the program can directly compute.

Core Principle: Parameter Mapping and Precision Adaptation

  • Mapping by Model Structure: Different models (like GPT-2 and Llama) have varying numbers of layers and neurons. The loader needs to read the configuration file and map parameters to the program-defined structures in the order of “embedding layer → Transformer layer → output layer.”
  • Flexible Precision Switching: Supports formats like FP32 (full precision), FP16 (half precision), and BF16 (brain half precision), where FP16 and BF16 can reduce memory usage by half with almost no loss in performance, significantly speeding up inference.

Code Implementation

  • Data Structure: Define <span>ModelParams</span> structure to store global parameters (word embedding matrix, position embedding matrix), and <span>TransformerLayerParams</span> to store the weights of a single Transformer layer (attention layer, feedforward network layer, normalization layer parameters).
  • Core Function: The <span>load</span> function is responsible for reading the binary weight file and JSON configuration file, parsing them to fill the corresponding matrices (<span>Matrix</span>) and vectors (<span>Vector</span>) while handling precision conversion.
  • Design Insight: The matrix is encapsulated in a custom <span>Matrix</span> class, which internally uses <span>vector</span> to store data, ensuring memory continuity while supporting basic operations (matrix multiplication, addition), making it more flexible than native two-dimensional arrays.

3. Transformer Decoder: The “Core Brain” of Inference

This is the core module of the framework, responsible for receiving the token ID sequence and generating complete responses through word-by-word prediction. The core is the Transformer decoder structure, with the main logic being “contextual association + token-by-token generation.”

Core Principle: Attention Mechanism and KV Cache

  • Self-Attention Mechanism: Like a “magnifying glass,” when predicting the next token, it can focus on the most relevant content in the preceding sequence (for example, when predicting the next token after “the capital of France,” it will pay special attention to “France” and “capital”). By calculating the matrix operations of queries (Q), keys (K), and values (V), it obtains attention weights, which are then weighted and summed to get contextual information.
  • Token-by-Token Generation: The model does not output the entire sentence at once; instead, it starts from the last token of the input sequence, predicting only the most likely next token each time, repeating this until the maximum length is reached.
  • KV Cache Optimization: Each time a new token is generated, the Q, K, and V calculation results of the preceding sequence are not recalculated but stored in the cache, with only the current token’s calculation results being added, reducing time complexity from O(n²) to O(n), significantly improving speed.

Code Implementation

  • Core Functions: <span>scaled_dot_product_attention</span> implements self-attention calculation, including QKV matrix operations, causal masking (to avoid future information leakage), softmax probability conversion, and KV cache updates; <span>generate</span> encapsulates the loop prediction logic, adding the newly generated token to the sequence each time until termination conditions are met.
  • Data Flow: Input token IDs are converted into vectors through word embeddings and position embeddings, sent into multiple Transformer layers (attention layer → normalization → feedforward network → normalization), and finally converted into token probability distributions through a linear layer, sampling to obtain the next token ID.
  • Key Details: Use the <span>concatenate</span> function to implement KV cache concatenation, reusing allocated memory; activation functions (like GELU) and normalization operations use inline functions to reduce function call overhead.

4. Hardware Adaptation Layer: The “Accelerator” for Performance

The upper limit of inference speed is determined by hardware. The core of this module is to enable the framework to adapt to different hardware like CPUs and GPUs, maximizing hardware performance utilization.

Core Principle: Hardware Interface Encapsulation

  • CPU Inference: Directly use the C++ standard library and custom matrix operation logic to adapt to ordinary devices, focusing on general compatibility;
  • GPU Inference: Encapsulate matrix operations through CUDA kernel functions, leveraging the parallel computing power of GPUs (thousands of stream processors operating simultaneously), increasing the inference speed of large models by over 10 times.

Code Implementation

  • Conditional Compilation: Use <span>#ifdef CUDA</span> and other macro definitions to switch between CPU/GPU modes, allowing the corresponding operation logic to be selected based on the hardware environment at compile time;
  • Unified Interface: The upper inference logic does not need to concern itself with the underlying hardware, calling unified interfaces like <span>matmul</span> (matrix multiplication), <span>add</span> (matrix addition), etc., with the underlying automatically adapting to CPU or GPU operations.

3. Code Structure: Lightweight and Efficient Project Design

The entire framework’s code structure follows the “single responsibility” principle, with no redundant dependencies, and the core files do not exceed 10, which is also a reflection of its “lightweight” nature:

src/
├── tokenizer.h          // Tokenizer class declaration
├── tokenizer.cpp        // Tokenizer implementation (core of BPE algorithm)
├── model.h              // Model structure definition (including Transformer layers)
├── model.cpp            // Model loading and parameter parsing
├── transformer.h        // Core logic of Transformer decoder
├── transformer.cpp      // Self-attention and feedforward network implementation
├── utils.h              // Utility functions (matrix operations, file operations, etc.)
├── main.cpp             // Main program entry (linking inference process)
└── config.h             // Configuration parameters (model path, inference length, etc.)

This structure strictly adheres to the “single responsibility” principle: each file does only one thing, such as <span>tokenizer</span> only handling text and token conversion, and <span>transformer</span> only responsible for core inference calculations. If you want to modify a certain module (like changing a tokenization algorithm), you can directly edit the corresponding file.

4. Implementation Details of Core Modules

1. Tokenizer (tokenizer.h/cpp): The “Translator” that Converts Text to Numbers

Core Data Structure

// Key structures defined in tokenizer.h
class Tokenizer {
private:
    unordered_map<string, int> vocab;  // Vocabulary: string → token ID (fast lookup with hash table)
    unordered_map<int, string> vocab_inv;  // Reverse vocabulary: token ID → string
    vector<pair<string, int>> merges;  // BPE merging rules (high-frequency substring combinations)
    // Regular expression: pre-compiled matching rules (letters, numbers, symbols, etc.)
    regex pattern;
public:
    bool load(const string&amp; vocab_path, const string&amp; merges_path);  // Load vocabulary and merging rules
    vector<int> encode(const string&amp; text);  // Convert text to token ID
    string decode(const vector<int>&amp; tokens);  // Convert token ID to text
};

Key Implementation: BPE Encoding Process (encode function)

The core of BPE is “first split into characters, then merge according to rules.” The code does it this way:

  1. Text Preprocessing: Use regular expressions to split the text into “atomic units” (for example, “Hello!” splits into [“Hello”, “!”]), avoiding mixing symbols and text.
  2. Split into Character Level: Each unit is further split into individual characters (for example, “Hello” → [“H”, “e”, “l”, “l”, “o”]).
  3. Merge According to Merging Rules: Loop through the <span>merges</span> table, merging the highest frequency combinations found in the current sequence (for example, [“l”, “l”] → [“ll”]), until no more combinations can be merged.
  4. Lookup to Convert to ID: The merged substrings are looked up in the <span>vocab</span> hash table to find the corresponding number, which is the final token ID.

Here, using <span>unordered_map</span> to store the vocabulary is because its lookup complexity is O(1), which is much faster than array traversal—tokenization requires thousands of lookups per second, so speed is crucial.

2. Model Loader (model.h/cpp): The “Mover” that Brings Parameters into the Program

The model files (like GPT-2 weights) are essentially a bunch of binary data, and the <span>model</span> module’s role is to “translate” this data into matrices that the program can directly use.

Core Data Structure

// Model parameter structure defined in model.h
struct ModelParams {
    // Embedding layer parameters (convert token ID to vector)
    Matrix<float> wte;  // Word embedding matrix (vocab_size, embed_dim)
    Matrix<float> wpe;  // Position embedding matrix (max_seq_len, embed_dim)
    
    // Transformer layer parameters (stacked multiple layers)
    vector<TransformerLayerParams> layers;
};

// Parameters for a single Transformer layer
struct TransformerLayerParams {
    // Weights related to self-attention
    Matrix<float> attn_c_attn_w;  // Attention query/key/value weight merge matrix
    Vector<float> attn_c_attn_b;  // Bias
    Matrix<float> attn_c_proj_w;  // Attention output projection weights
    Vector<float> attn_c_proj_b;  // Bias
    
    // Weights related to feedforward network
    Matrix<float> mlp_c_fc_w;     // Feedforward first layer weights
    Vector<float> mlp_c_fc_b;     // Bias
    Matrix<float> mlp_c_proj_w;   // Feedforward output projection weights
    Vector<float> mlp_c_proj_b;   // Bias
    
    Vector<float> ln_1_g;         // Gamma for first layer normalization
    Vector<float> ln_1_b;         // Beta for first layer normalization
    Vector<float> ln_2_g;         // Gamma for second layer normalization
    Vector<float> ln_2_b;         // Beta for second layer normalization
};

Key Implementation: Loading Weights (load function)

  1. Read Configuration File: First, obtain the basic information of the model (like vocabulary size, hidden layer dimensions, number of layers) from the JSON configuration to know how many parameters to load.
  2. Binary File Parsing: The model weight file (usually in .bin format) stores parameters in a fixed order. The program reads them in the order of “embedding layer → first Transformer layer → second Transformer layer → …” and maps them to the corresponding matrices in <span>ModelParams</span>.
  3. Data Type Handling: If the weights are FP16 (half precision), they will be converted to FP32 and stored in <span>Matrix<float></span> (C++ natively supports float, making it easier to handle); FP16 can also be retained to save memory, in which case <span>Matrix<half></span> will be used (requiring compiler support).

Here, a custom <span>Matrix</span> class encapsulates matrix operations, which actually uses a <span>vector</span> to store data, adding row/column properties and basic operations (matrix multiplication, addition), making it more flexible than directly using two-dimensional arrays.

3. Transformer Decoder (transformer.h/cpp): The “Core Engine” of Inference

This part is the most complex in the entire framework, responsible for implementing the logic of “predicting the next token based on context,” with the core being the self-attention mechanism and KV cache.

Key Function: Self-Attention Calculation (scaled_dot_product_attention)

The role of self-attention is to “let each word focus on relevant words in the context.” The code calculates it this way:

// Simplified self-attention implementation
Matrix<float> scaled_dot_product_attention(
    const Matrix<float>&amp; q,  // Query matrix (batch, seq_len, head_dim)
    const Matrix<float>&amp; k,  // Key matrix (batch, seq_len, head_dim)
    const Matrix<float>&amp; v,  // Value matrix (batch, seq_len, head_dim)
    Matrix<float>* kv_cache_k,  // K cache (optional, for speeding up inference)
    Matrix<float>* kv_cache_v   // V cache (optional, for speeding up inference)
) {
    int batch = q.rows();
    int seq_len = q.cols();
    int head_dim = q.depth();  // Assuming Matrix supports three dimensions (batch, seq, dim)
    
    // Calculate attention scores: Q*K^T / sqrt(head_dim)
    Matrix<float> attn_scores = matmul(q, k.transpose(1, 2)) / sqrt(head_dim);
    
    // Mask: Ensure the current word can only see previous words (to avoid future information leakage)
    attn_scores = apply_causal_mask(attn_scores);
    
    // Convert to probabilities (softmax)
    Matrix<float> attn_probs = softmax(attn_scores);
    
    // Weighted sum: Attention probabilities * V
    Matrix<float> output = matmul(attn_probs, v);
    
    // Update KV cache (used during inference, storing current K and V to avoid recalculating)
    if (kv_cache_k != nullptr &amp;&amp; kv_cache_v != nullptr) {
        *kv_cache_k = concatenate(*kv_cache_k, k, 1);  // Concatenate along the sequence dimension
        *kv_cache_v = concatenate(*kv_cache_v, v, 1);
    }
    
    return output;
}

Key Optimization: Implementation of KV Cache

During inference, each time a new token is generated, the preceding sequence remains unchanged. If K and V for all words are recalculated each time, a lot of redundant work is done. The KV cache stores previously calculated K and V, only calculating K and V for the new token, and then concatenating with the cache:

  • When inputting “the capital of France,” calculate K and V for the three tokens and store them in the cache.
  • When generating “the capital of France,” only calculate K and V for “the capital of France,” concatenate them with the cache to form K and V for four tokens, and then calculate attention.
  • This reduces the computational load from O(n²) to O(n) (where n is the current sequence length), significantly improving speed.

4. Linking Inference Process (main.cpp): Tying All Modules Together

The main program logic is straightforward: it sequentially calls each module:

int main() {
    // 1. Initialize configuration
    Config config = load_config("config.json");
    
    // 2. Load tokenizer
    Tokenizer tokenizer;
    tokenizer.load(config.vocab_path, config.merges_path);
    
    // 3. Load model parameters
    Model model;
    model.load(config.model_path);
    
    // 4. Encode input text
    string input_text = "The capital of France is";
    vector<int> input_tokens = tokenizer.encode(input_text);
    
    // 5. Generate inference
    Transformer transformer(model.params);
    vector<int> output_tokens = transformer.generate(
        input_tokens,
        config.max_length,  // Maximum generation length
        config.temperature  // Randomness control (0 = deterministic, 1 = random)
    );
    
    // 6. Decode output results
    string output_text = tokenizer.decode(output_tokens);
    cout << output_text << endl;  // Output: The capital of France is Paris...
    
    return 0;
}

The core of the generate function is “loop prediction”: each time the current token sequence is input into the model, the probability distribution for the next token is obtained, and a token is sampled and added to the sequence until the maximum length is reached or an end symbol is encountered.

Building and Running

mkdir build
cmake -B ./build -DCMAKE_BUILD_TYPE=Release
cmake --build ./build --config Release

This will generate an executable file and copy resources to the demo/bin directory, allowing you to run the demo:

cd demo/bin
./TinyGPT_demo
[INFO] Load model ...
[INFO] Load model done.
[INFO] Generated Outputs:
[INFO] ------------------------------------------------------------
[INFO] Prompt:    'Hello, my name is'
[INFO] Output:    ' Max! I am Phelan and I'm the world's greatest magician! I am the world's greatest magician! You are the world's greatest magician! You'
[INFO] ------------------------------------------------------------
[INFO] Prompt:    'The president of the United States is'
[INFO] Output:    ' on a temporary trip to Asia, and the Pentagon has made several announcements about what's next for the war on terror.

The next day, General Martin Dempsey'
[INFO] ------------------------------------------------------------
[INFO] Prompt:    'The capital of France is'
[INFO] Output:    ' located in the eastern part of the country, so it is very easy to find houses in this part of the country. The most important houses are in Paris, and'
[INFO] ------------------------------------------------------------
[INFO] Prompt:    'The future of AI is'
[INFO] Output:    ' forever. Our time is now.


Sequel to the game, The Mighty Ducks is available on Android and iOS, and a new Android app is also coming'
[INFO] ------------------------------------------------------------
[INFO] Time cost: 1907 ms, speed: 83.90 token/s

If you need the complete source code, please add the WeChat number (c17865354792)

5. The “Efficiency Code” in Technical Details

This framework can run on ordinary devices due to these detail optimizations:

  1. Memory Reuse: During matrix operations, try to reuse already allocated memory (for example, the KV cache directly concatenates on the original matrix), avoiding performance loss from frequent new/delete operations.
  2. Avoid Redundant Calculations: Layer normalization (LayerNorm), activation functions (GELU), and other operations are implemented using inline functions to reduce function call overhead; matrix multiplication uses loop unrolling optimization to make CPU pipelines more efficient.
  3. Lightweight Dependencies: No heavy frameworks like PyTorch/TensorFlow are used; matrix operations are implemented independently (or using the lightweight Eigen library), vocabulary parsing uses C++ standard library’s regex and unordered_map, resulting in a very small compiled program size (a few MB).
  4. Conditional Compilation for Hardware Acceleration: By using <span>#ifdef CUDA</span> macro definitions, GPU acceleration interfaces are reserved—replacing matrix operations with CUDA kernel functions can run on NVIDIA graphics cards, increasing speed by over 10 times.

Conclusion: Core Ideas Behind Code Implementation

The implementation of this C++ lightweight LLM inference framework is essentially about “doing the most critical things with the simplest code”:

  • Only retain the modules necessary for inference (tokenization, loading, Transformer, generation), cutting out redundant logic related to training;
  • Prioritize “fast” data structures (hash tables for vocabulary, vectors for matrices) and “efficient” operations (KV cache, memory reuse);
  • Keep interfaces between modules as simple as possible (inputs and outputs are vectors or matrices), making it easy to later modify into a more complex framework.

If you want to make some modifications, you can start by trying these two directions: one is to add quantization to self-attention (like INT8) to further reduce memory usage; the other is to improve the GPU acceleration part to enable large models to run. The code structure is clear, making it easy to modify.

Welcome to follow the WeChat official account 【Programmer Coding

Leave a Comment