Exploring the Essence of LoRA: A Low-Rank Projection of Full Gradients

Exploring the Essence of LoRA: A Low-Rank Projection of Full Gradients

Article Title:
FLORA: Low-Rank Adapters Are Secretly Gradient Compressors

Article Link:

https://arxiv.org/pdf/2402.03293

This paper not only introduces a brand new high-rank efficient fine-tuning algorithm but also deeply interprets the essence of LoRA: LoRA is a low-rank projection of full gradients. In fact, this perspective is not surprising; we can see traces of this viewpoint in many past papers. For example, LoRA-FA proposes not to fine-tune the A matrix in LoRA but to only fine-tune the B matrix; LoRA+ suggests using a larger learning rate to fine-tune the B matrix in LoRA; ReLoRA/PLoRA/CoLA and others propose periodically merging LoRA weights for reinitialization. Their essence is actually the title of this paper.

The method proposed in this paper is actually not superior to LoRA because the essence of LoRA saving GPU memory is that when using the Adam optimizer, it can avoid calculating the first and second moments of the full weights (both of which must be represented in fp32, which consumes a lot of memory). The method proposed in this paper is actually an optimization of gradient accumulation, which can avoid the extra memory overhead caused by accumulating gradients during the gradient accumulation process. When using Adam for training, FLoRA cannot actually save memory. However, the analysis of LoRA in this paper is very brilliant.

Exploring the Essence of LoRA: A Low-Rank Projection of Full Gradients

The B Matrix in LoRA Dominates Weight Updates

1.1 Gradients in LoRA

First, we define the following symbols: WeightMatrix and MatrixMatrix and MatrixMatrix.
Then we have the formula for LoRA:
Taking the derivative of the loss:

Exploring the Essence of LoRA: A Low-Rank Projection of Full Gradients

1.2 Amount of Weight Updates
After obtaining the gradient, for the backward propagation at step t+1:

Exploring the Essence of LoRA: A Low-Rank Projection of Full Gradients

Where is the learning rate, and the update for matrix B is similar. Since matrix B is initialized to all zeros and matrix A is initialized to a normal distribution (in the peft library, it is kaiming initialization, which is different from the original LoRA). So for any step T during the training process, we have:

Exploring the Essence of LoRA: A Low-Rank Projection of Full Gradients

Where ,. Specific definitions will be proposed in the appendix, meaning that at any step T during training, the values of A and B are obtained from the projection of the matrix initialized from the normal distribution. After obtaining the values of A and B, we can derive that for any step T during the training process:

Exploring the Essence of LoRA: A Low-Rank Projection of Full Gradients

Since is a zero matrix, we have:

Exploring the Essence of LoRA: A Low-Rank Projection of Full Gradients

Substituting the values of and :

Exploring the Essence of LoRA: A Low-Rank Projection of Full Gradients

The third term can be ignored, indicating that when using LoRA for training, freezing matrix A can achieve similar effects (this is the idea behind LoRA-FA and also LoRA+; since matrix B dominates the parameter updates, a larger learning rate is required). When we freeze matrix A, we have:

Exploring the Essence of LoRA: A Low-Rank Projection of Full Gradients

In other words, LoRA-FA essentially updates weight W through a one-time random dimensionality reduction projection of the gradient of weight W. LoRA and LoRA-FA are similar; we can consider this viewpoint to be largely valid.
Exploring the Essence of LoRA: A Low-Rank Projection of Full Gradients
▲ Comparison of LoRA-FA and LoRA effects, showing very similar performance

Exploring the Essence of LoRA: A Low-Rank Projection of Full Gradients

FLoRA Algorithm

Based on the above insights, the author team proposed a new training algorithm:
Exploring the Essence of LoRA: A Low-Rank Projection of Full Gradients

▲ This algorithm is actually a variant of gradient accumulation. Given the gradient accumulation steps T, gradients for parameter W are computed over T steps, and at each step, a randomly initialized dimensionality reduction matrix reduces the gradient, achieving low-rank updates for the parameters by accumulating the reduced gradient values.

For optimizers using momentum mechanisms like Adam:
Exploring the Essence of LoRA: A Low-Rank Projection of Full Gradients
▲ The above algorithm can achieve low-rank accumulation of momentum

Exploring the Essence of LoRA: A Low-Rank Projection of Full Gradients

Appendix

Here, the authors use mathematical induction to prove:

Exploring the Essence of LoRA: A Low-Rank Projection of Full Gradients

And during the proof process, the forms of and were obtained:

Exploring the Essence of LoRA: A Low-Rank Projection of Full Gradients

Exploring the Essence of LoRA: A Low-Rank Projection of Full Gradients

Leave a Comment