Click “Read the original text” to go directly to the official GitHub.
The following code implements the LoRA (Low-Rank Adaptation) technique, which fine-tunes pre-trained models without significantly increasing the number of parameters. Here, I will explain the function and role of each part in detail:
Core Class: LoRALayer
This is the base class for all LoRA layers, defining the basic parameters and functions of LoRA:
class LoRALayer(): def __init__(self, r: int, lora_alpha: int, lora_dropout: float, merge_weights: bool): self.r = r # Rank of the low-rank matrix self.lora_alpha = lora_alpha # Numerator of the scaling factor # Optional dropout layer to prevent overfitting if lora_dropout > 0.: self.lora_dropout = nn.Dropout(p=lora_dropout) else: self.lora_dropout = lambda x: x # Return input directly when no dropout self.merged = False # Flag to indicate whether weights have been merged self.merge_weights = merge_weights # Flag to indicate whether to merge weights
This class mainly serves to store the core parameters of LoRA: the rank r of the low-rank matrix, the alpha value related to the scaling factor, the dropout configuration, and the flags related to weight merging.
LoRA Implementation in Embedding Layer
Inherits from PyTorch’s nn.Embedding and LoRALayer, adding LoRA capabilities to the embedding layer:
class Embedding(nn.Embedding, LoRALayer): def __init__(self, num_embeddings: int, embedding_dim: int, r: int = 0, ...): # Initialize the original embedding layer and LoRA parameters nn.Embedding.__init__(self, num_embeddings, embedding_dim, **kwargs) LoRALayer.__init__(self, r=r, lora_alpha=lora_alpha, lora_dropout=0, merge_weights=merge_weights) # Create LoRA parameters only when r > 0 if r > 0: # Low-rank matrix A: (r, num_embeddings) self.lora_A = nn.Parameter(self.weight.new_zeros((r, num_embeddings))) # Low-rank matrix B: (embedding_dim, r) self.lora_B = nn.Parameter(self.weight.new_zeros((embedding_dim, r))) self.scaling = self.lora_alpha / self.r # Scaling factor self.weight.requires_grad = False # Freeze pre-trained weights self.reset_parameters() # Initialize parameters
The core principle is to add two low-rank matrices A and B on top of the original embedding layer, projecting high-dimensional input into a low-dimensional space through A, then projecting back to the original dimension through B, and finally adding the result to the original embedding.
<span>reset_parameters()</span> method is used to initialize parameters,<span>train()</span> method is used to manage the weight merging state when switching between training and inference modes,<span>forward()</span> method implements the forward propagation logic: when LoRA is enabled and weights are not merged, it computes the original embedding plus the LoRA adjustment.
LoRA Implementation in Linear Layer
Similar to the Embedding class, but for linear layers:
class Linear(nn.Linear, LoRALayer): def __init__(self, in_features: int, out_features: int, r: int = 0, ...): # Initialize the original linear layer and LoRA parameters nn.Linear.__init__(self, in_features, out_features, **kwargs) LoRALayer.__init__(self, r=r, lora_alpha=lora_alpha, lora_dropout=lora_dropout, merge_weights=merge_weights) self.fan_in_fan_out = fan_in_fan_out # Flag for weight storage method if r > 0: # Low-rank matrix A: (r, in_features) self.lora_A = nn.Parameter(self.weight.new_zeros((r, in_features))) # Low-rank matrix B: (out_features, r) self.lora_B = nn.Parameter(self.weight.new_zeros((out_features, r))) self.scaling = self.lora_alpha / self.r self.weight.requires_grad = False # Freeze pre-trained weights self.reset_parameters()
The forward propagation logic is similar to that of the Embedding class, but includes weight transposition handling to accommodate different weight storage methods (fan_in_fan_out).
MergedLinear Layer
Supports applying LoRA to certain dimensions of the output features:
class MergedLinear(nn.Linear, LoRALayer): def __init__(self, in_features: int, out_features: int, r: int = 0, enable_lora: List[bool] = [False], ...): # Initialize parameters... self.enable_lora = enable_lora # Flag to indicate which output groups enable LoRA # Implementation details...
This is particularly useful in certain models (such as the query/key/value projections of Transformers), allowing LoRA adjustments to be applied only to certain dimensions.
LoRA Implementation in Convolutional Layers
Adds LoRA capabilities to convolutional layers through the ConvLoRA class, deriving Conv2d, Conv1d, and Conv3d:
class ConvLoRA(nn.Module, LoRALayer): def __init__(self, conv_module, in_channels, out_channels, kernel_size, r=0, ...): self.conv = conv_module(in_channels, out_channels, kernel_size, **kwargs) # Create LoRA parameters for the convolutional layer... def forward(self, x): if self.r > 0 and not self.merged: # Add LoRA adjustment to convolution weights return self.conv._conv_forward( x, self.conv.weight + (self.lora_B @ self.lora_A).view(self.conv.weight.shape) * self.scaling, self.conv.bias ) return self.conv(x)
The LoRA implementation in convolutional layers is similar to that in linear layers, but requires reshaping the result of the low-rank matrix product to fit the shape of the convolution kernel.
Utility Functions
1. mark_only_lora_as_trainable: Marks only LoRA-related parameters as trainable, freezing other parameters.
def mark_only_lora_as_trainable(model: nn.Module, bias: str = 'none') -> None: for n, p in model.named_parameters(): if 'lora_' not in n: # Only keep LoRA-related parameters trainable p.requires_grad = False # Handle training settings for bias terms...
2. lora_state_dict: Extracts a state dictionary containing only LoRA-related parameters for easy saving and loading.
def lora_state_dict(model: nn.Module, bias: str = 'none') -> Dict[str, torch.Tensor]: my_state_dict = model.state_dict() if bias == 'none': return {k: my_state_dict[k] for k in my_state_dict if 'lora_' in k} # Handle other bias options...
Summary of LoRA Working Principles
1. Low-Rank Decomposition: Represents weight updates as the product of two low-rank matrices A and B, significantly reducing the number of trainable parameters.
2. Weight Freezing: Freezes the original weights of the pre-trained model, training only the low-rank matrices added by LoRA.
3. Merging/Splitting Weights: Keeps weights separate during training, allowing for merging during inference to improve efficiency.
4. Flexible Application: LoRA can be selectively applied to certain layers or dimensions of the model.
This method significantly reduces the number of parameters and computational costs for fine-tuning while maintaining model performance, making it particularly suitable for personalized fine-tuning of large models.
If you have any questions, see the comments section. Goodbye.