I have recently compiled a simple and easy-to-understand guide on the GPU memory requirements for fine-tuning with LoRA and QLoRA, which can help you estimate the memory needed when fine-tuning using LoRA and QLoRA. Below, we will explain step by step, requiring minimal background knowledge.1. What are LoRA and QLoRA?
- LoRA (Low-Rank Adaptation):This is a method that allows large models to adjust only a small number of parameters during fine-tuning. In simple terms, you do not need to update all the weights of the entire model, but only a small portion (usually a low-rank matrix). This can significantly reduce the amount of data that needs to be stored and computed, thus saving GPU memory.
- QLoRA:QLoRA builds on LoRA by adding quantization techniques, representing part of the model’s data with lower bit widths (such as 4-bit or 8-bit). After quantization, the model occupies less GPU memory, but there may be a slight compromise in accuracy. Overall, QLoRA is suitable for resource-constrained devices.
2. Basic Components of GPU Memory RequirementsThe GPU memory during fine-tuning is mainly used in the following parts:
-
Model Parameters:All weights of the model. For LoRA, we only need to store the additional low-rank matrices, while the original model parameters generally remain unchanged (usually in a frozen state).
-
Activations:The intermediate results produced during the computation of each layer. This part of the memory increases with batch size and sequence length.
-
Optimizer State:If using optimizers like Adam, gradient and momentum information will be saved. When using LoRA, if only part of the parameters is trained, this memory will be significantly reduced.
-
Other Overheads:Including caches, additional variables, etc., which usually occupy less memory.
3. How to Estimate GPU Memory Requirements Simply
Here is a simplified estimation method to help you roughly understand how much GPU memory is needed:
-
Calculate the Memory Occupied by Model Parameters:
-
The original model (frozen part) can usually be quantized (in QLoRA), significantly reducing memory usage.
-
The additional parameters introduced by LoRA: If the size of the original weight matrix is
and a rank ofM × N, is used for low-rank decomposition, the additional parameters are approximatelyr .r × ( M + N ) -
Activations and Temporary Computation Cache:
-
The memory requirement for activations is roughly proportional to batch size, sequence length, and the number of network layers. Approximate values can be obtained through experiments or by consulting specific model guidelines.
-
Optimizer State:
-
If only training the LoRA part, then only allocate optimizer state for this part, usually 2-3 times the amount of additional parameters (for example, Adam optimizer saves first and second moments).
-
Impact of Quantization (QLoRA):
-
If the model uses 4-bit or 8-bit representation, the number of bits occupied by each parameter decreases, thus reducing memory accordingly. For example, 8-bit occupies 1 byte, while the original 16-bit (2 bytes) or 32-bit (4 bytes) full-precision parameters will reduce memory usage by 50% or more.
4. A Simple ExampleBelow, we will calculate two scenarios using a 7B model as an example: fine-tuning using only LoRA (without quantization) and fine-tuning using QLoRA.1. Fine-tuning using only LoRA (without quantization)In this scenario, the main model remains in full precision (e.g., 32-bit), while the LoRA part only updates a small portion of parameters (assumed to be about 1%):(1) Main Model Parameters
- Number of parameters: 7B (7 billion) parameters
- Memory per parameter: 4 bytes (32-bit)
- Storage requirement: 7B × 4 bytes ≈ 28 GB
(2) Additional LoRA Part
- Assuming 1% of parameters are updated: 7 billion × 1% = 70 million parameters
- LoRA parameter storage (FP32): 70 million × 4 bytes ≈ 280MB
- Optimizer state (e.g., Adam, requires saving gradients and momentum, about 2 times memory): 70 million × 4 bytes × 2 ≈ 560MB
- Total additional LoRA part: 280MB + 560MB ≈ 840MB
(3) Activations and Other Runtime OverheadsBased on the number of network layers, batch size, and sequence length, generally reserve about 2GB (this value can fluctuate based on actual conditions).(4) Total GPU Memory RequirementAdd up all parts:
- Main model: 28GB
- LoRA part: 840MB
- Activations and other overheads: about 2GB
Total: 28GB + 0.84GB + 2GB ≈ 30.84GBSummary: If quantization is not used, the 7B model requires approximately 31GB of GPU memory for LoRA fine-tuning (for estimation purposes, actual values may vary slightly).2. Fine-tuning using QLoRAIn QLoRA mode, the main model parameters are stored with low bit width (e.g., 4-bit quantization), while the LoRA part usually remains in full precision (FP32), which can significantly reduce the GPU memory usage of the main model.(1) Quantized Main Model
- 4-bit representation: Each parameter occupies about 0.5 bytes
- Storage requirement: 7B × 0.5 bytes ≈ 3.5 GB
(2) Additional LoRA Part
As before (still assuming 1% of parameters are updated):
-
Parameter storage: 280MB
-
Optimizer state: 560MB
-
Total approximately: 840MB
(3) Activations and Other Runtime Overheads
-
Also requires approximately: about 2GB
(4) Total GPU Memory Requirement
Add up all parts:
-
Quantized main model: 3.5GB
-
LoRA part: 840MB
-
Activations and others: about 2GB
Total: 3.5GB + 0.84GB + 2GB ≈ 6.34GB
Summary: After quantization using QLoRA, the 7B model requires approximately 6-7GB of GPU memory for LoRA fine-tuning, significantly reducing memory requirements.
Conclusion:
-
Fine-tuning with only LoRA (full precision model): Requires approximately 30-31GB of GPU memory.
-
QLoRA fine-tuning (quantized main model + LoRA): Requires approximately 6-7GB of GPU memory.
5. When to Choose LoRA/QLoRA Fine-Tuning?If your originally deployed model is in full precision (FP32), then use LoRA fine-tuning. If the originally deployed model is in half precision (FP16) or a quantized version, then use QLoRA fine-tuning. After all, the original model is in full precision, but during the QLoRA fine-tuning process, it will be quantized to reduce memory overhead, and only the additional adapter parameters will be trained. The benefit of this approach is significant memory savings, but it also means that you are using the quantized model state during training, rather than a purely full precision state.
6. Practical Tuning Suggestions
-
Test with Small Batches: Start with a small batch size to observe memory usage, then gradually increase.
-
Consult Specific Model Guidelines: Each model and framework may have specific recommendations, such as those typically found in Hugging Face and related GitHub projects.
-
Use Monitoring Tools: Tools like nvidia-smi can monitor memory usage in real-time, helping you adjust parameters.
Summary
-
LoRA only requires updating a small portion of parameters, thus saving much more GPU memory compared to full model fine-tuning.
-
QLoRA further compresses model storage requirements through quantization techniques, making it more suitable for devices with limited GPU memory.
-
When estimating GPU memory requirements, primarily consider model parameters, activations, and optimizer state, and then adjust based on specific conditions (such as batch size and sequence length).
I hope this simple guide helps you gain an intuitive understanding of the GPU memory requirements for LoRA and QLoRA fine-tuning. Thank you for watching~~