How to Code LoRA from Scratch: A Comprehensive Guide

Excerpt from lightning.ai

Author: Sebastian Raschka

Compiled by Machine Heart

Editor: Chen Ping

The author states: Among various effective LLM fine-tuning methods, LoRA remains his top choice.

LoRA (Low-Rank Adaptation) is a popular technique for fine-tuning LLMs (Large Language Models) that was first proposed by researchers from Microsoft in the paper “LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS”. Unlike other techniques, LoRA does not adjust all parameters of the neural network, but focuses on updating a small number of low-rank matrices, significantly reducing the computational cost required to train the model.

Because the fine-tuning quality of LoRA is comparable to that of full model fine-tuning, many people refer to this method as the fine-tuning magic. Since its release, many have been curious about this technology and want to write the code from scratch to better understand the research. Previously, there was a lack of suitable documentation, but now, the tutorial has arrived.

This tutorial is authored by renowned machine learning and AI researcher Sebastian Raschka, who states that among various effective LLM fine-tuning methods, LoRA is still his top choice. To this end, Sebastian has specifically written a blog “Code LoRA From Scratch,” building LoRA from the ground up, which he sees as a great learning method.

How to Code LoRA from Scratch: A Comprehensive Guide

In simple terms, this article introduces Low-Rank Adaptation (LoRA) by writing code from scratch. In the experiment, Sebastian fine-tuned the DistilBERT model for a classification task.

The comparison results between LoRA and traditional fine-tuning methods show that the accuracy achieved using LoRA on the test set reached 92.39%, which demonstrates better performance compared to fine-tuning only the last few layers (which achieved a test accuracy of 86.22%).

How did Sebastian achieve this? Let’s continue reading.

Writing LoRA from Scratch

The code representation of a LoRA layer is as follows:

How to Code LoRA from Scratch: A Comprehensive Guide

Here, in_dim is the input dimension of the layer to be modified by LoRA, and the corresponding out_dim is the output dimension of the layer. The code also adds a hyperparameter, the scaling factor alpha, where a higher value of alpha means a greater adjustment to the model’s behavior, while a lower value means the opposite. Furthermore, this article initializes matrix A with smaller values from a random distribution, and matrix B is initialized to zero.

It is worth mentioning that LoRA typically operates in the linear (feed-forward) layers of neural networks. For example, for a simple PyTorch model or a module with two linear layers (which could be the feed-forward module of a Transformer block), its forward method can be expressed as:

How to Code LoRA from Scratch: A Comprehensive Guide

When using LoRA, updates are typically added to the outputs of these linear layers, resulting in the following code:

How to Code LoRA from Scratch: A Comprehensive Guide

If you want to implement LoRA by modifying an existing PyTorch model, a simple method is to replace each linear layer with a LinearWithLoRA layer:

How to Code LoRA from Scratch: A Comprehensive Guide

These concepts are summarized in the following diagram:

How to Code LoRA from Scratch: A Comprehensive Guide

To apply LoRA, this article replaces the existing linear layers in the neural network with a LinearWithLoRA layer that combines the original linear layer and the LoRALayer.

How to Get Started with LoRA Fine-Tuning

LoRA can be used for models such as GPT or image generation. For simplicity, this article uses a small BERT (DistilBERT) model for text classification.

How to Code LoRA from Scratch: A Comprehensive Guide

Since this article only trains new LoRA weights, it is necessary to set requires_grad to False for all trainable parameters to freeze all model parameters:

How to Code LoRA from Scratch: A Comprehensive Guide

Next, use print(model) to check the structure of the model:

How to Code LoRA from Scratch: A Comprehensive Guide

The output indicates that the model consists of 6 transformer layers, each containing linear layers:

How to Code LoRA from Scratch: A Comprehensive Guide

Additionally, the model has two linear output layers:

How to Code LoRA from Scratch: A Comprehensive Guide

By defining the following assignment functions and loops, you can selectively enable LoRA for these linear layers:

How to Code LoRA from Scratch: A Comprehensive Guide

Use print(model) again to check the updated structure of the model:

How to Code LoRA from Scratch: A Comprehensive Guide

As seen above, the linear layers have been successfully replaced by LinearWithLoRA layers.

If the model is trained using the default hyperparameters shown above, it will yield the following performance on the IMDb movie review classification dataset:

  • Training Accuracy: 92.15%

  • Validation Accuracy: 89.98%

  • Test Accuracy: 89.44%

In the next section, this article compares these LoRA fine-tuning results with traditional fine-tuning results.

Comparison with Traditional Fine-Tuning Methods

In the previous section, LoRA achieved a test accuracy of 89.44% under default settings. How does this compare to traditional fine-tuning methods?

To make a comparison, this article conducted another experiment using the DistilBERT model as an example, but only updated the last 2 layers during training. The researcher achieved this by freezing all model weights and then unfreezing the two linear output layers:

How to Code LoRA from Scratch: A Comprehensive Guide

The classification performance obtained by training only the last two layers is as follows:

  • Training Accuracy: 86.68%

  • Validation Accuracy: 87.26%

  • Test Accuracy: 86.22%

The results show that LoRA outperforms traditional fine-tuning methods that only update the last two layers, but it uses 4 times fewer parameters. The parameters that need to be updated for fine-tuning all layers are 450 times more than those set by LoRA, but the test accuracy only improves by 2%.

Optimizing LoRA Configuration

The results mentioned earlier were all from LoRA conducted under default settings, with the hyperparameters as follows:

How to Code LoRA from Scratch: A Comprehensive Guide

If users want to try different hyperparameter configurations, they can use the following command:

How to Code LoRA from Scratch: A Comprehensive Guide

However, the optimal hyperparameter configuration is as follows:

How to Code LoRA from Scratch: A Comprehensive Guide

Under this configuration, the results are as follows:

  • Validation Accuracy: 92.96%

  • Test Accuracy: 92.39%

It is worth noting that even with only a small number of trainable parameters in the LoRA setup (500k vs 66M), the accuracy is still slightly higher than that obtained through full fine-tuning.

Original link: https://lightning.ai/lightning-ai/studios/code-lora-from-scratch?continueFlag=f5fc72b1f6eeeaf74b648b2aa8aaf8b6

How to Code LoRA from Scratch: A Comprehensive Guide

© THE END

For reprints, please contact this public account for authorization

Submissions or inquiries: content@jiqizhixin.com

Leave a Comment