Solving Composite Problems in One Inference: The MeteoRA Architecture for Scalable Integration of Knowledge Modules in MoE-based Large Language Models

The AIxiv column is a section published by Machine Heart that features academic and technical content. Over the past few years, the AIxiv column has reported on more than 2000 pieces of content, covering top laboratories from major universities and companies worldwide, effectively promoting academic exchange and dissemination. If you have excellent work to share, we welcome submissions or contact for reporting.Submission email: [email protected]; [email protected]

This article is from the DeepEngine team at the Software Research Institute of Nanjing University, authored by Assistant Professor Xu Jingwei, Master’s students Lai Junyu and Huang Yunpeng. This paper has been accepted by ICLR2025.In the field of large language models, the pre-training + fine-tuning paradigm has become an important foundation for deploying various downstream applications. Within this framework, the use of low-rank adaptation (LoRA) methods for efficient fine-tuning of large model parameters (PEFT) has produced a large number of reusable LoRA adapters for specific tasks. However, the fine-tuning method using LoRA adapters requires explicit intent selection, leading to challenges in autonomous task awareness and switching when using a single large language model equipped with multiple LoRA adapters.To address this, the team proposed a scalable and efficient multi-task embedding architecture called MeteoRA. This framework reuses multiple task-specific LoRA adapters onto the base model through a full-mode mixture of experts (MoE) model. The framework also includes a novel forward acceleration strategy for the mixture of experts model to improve the efficiency bottlenecks of traditional implementations. Large language models equipped with the MeteoRA framework have achieved outstanding performance in handling composite problems, efficiently solving ten different sequentially input problems in a single inference, demonstrating the model’s enhanced ability to switch adapters in real-time.

Paper Title: MeteoRA: Multiple-tasks Embedded LoRA for Large Language Models
Paper Link: https://arxiv.org/abs/2405.13053
Project Homepage: https://github.com/NJUDeepEngine/meteora

The innovations of this work mainly include the following parts:

Scalable LoRA Integration Framework:The MeteoRA framework can integrate existing LoRA adapters and provides the capability for large language models to autonomously select and switch different LoRAs on demand.
Forward Acceleration Strategy for Mixture of Experts Models:This reveals the efficiency issues of mixture of experts models, implementing a forward propagation acceleration strategy using new GPU kernel operations, achieving an average speedup of about 4 times while maintaining the same memory overhead.
Outstanding Model Performance:Evaluations indicate that large language models using the MeteoRA framework exhibit exceptional performance in composite tasks, thereby enhancing the practical effectiveness of large language models when combining existing LoRA adapters.

Related WorkLow-Rank Adaptation (LoRA): Low-rank adaptation [1] provides a strategy to reduce the number of trainable parameters required for fine-tuning downstream tasks. For a transformer-based large language model, LoRA injects two trainable low-rank matrices into the weight matrix of each basic linear layer, representing the model’s fine-tuning on the original weight matrix through the multiplication of the two low-rank matrices. LoRA can be applied to seven weight matrices in the self-attention module and multi-layer perceptron module of transformers, effectively reducing the scale of fine-tuning weights. The application of low-rank adaptation technology allows for training LoRA adapter parameters using the optimization objectives of autoregressive language models without changing the pre-trained model parameters.Multi-task LoRA Fusion: LoRA adapters are typically fine-tuned to accomplish specific downstream tasks. To enhance the ability of large language models to handle multiple tasks, there are two mainstream approaches. One method is to merge datasets from multiple tasks and train a single LoRA adapter based on this, but related research has shown that learning knowledge from different domains simultaneously is challenging. [2] The other method is to directly integrate multiple LoRA adapters into one model, such as PEFT [3] and S-LoRA [4]. However, popular frameworks must explicitly specify the injected LoRA adapters, leading to a lack of autonomous selection and timely switching capabilities for the model. Existing works like LoRAHub [5] can select LoRA adapters without manual specification, but still require a small amount of learning for each downstream task.Mixture of Experts (MoE): The mixture of experts model is a machine learning paradigm that improves efficiency and performance by combining the predictions of multiple models, dynamically assigning inputs to the most relevant “experts” through a gating network to obtain predictions. [6] This model utilizes specific knowledge from different experts to enhance overall performance on diverse and complex problems, with existing research demonstrating the effectiveness of MoE on large-scale neural networks [7]. For each input, MoE activates only a portion of the experts, significantly improving computational efficiency without compromising model scale. This method has been proven to be very effective in scaling transformer-based application architectures, such as Mixtral [8].Method DescriptionThis work proposes a scalable and efficient multi-task embedding architecture, MeteoRA, which can directly use LoRA adapters that are either from the open-source community or fine-tuned for specific downstream tasks, autonomously selecting and switching appropriate LoRA adapters based on the input. As shown in Figure 1, the MeteoRA module can be integrated into all basic linear layers of attention modules and multi-layer perceptron modules, with each module containing a series of low-rank matrices. Through the MoE forward acceleration strategy, large language models can efficiently solve a wide range of problems.Figure 1: Framework of large language models integrated with the MeteoRA module implementing MoE architectureFigure 2 illustrates the internal structure of the MeteoRA module. The MeteoRA module embeds multiple LoRA adapters and implements the MoE architecture through a gating network. This gating network selects the top-k LoRA adapters based on the input and combines them as fine-tuning weights for forward propagation. Through this architecture, the gating network executes a routing strategy to select appropriate LoRA adapters based on the input. Each MeteoRA module contains an independent gating network, with different gating networks making independent decisions based on their inputs, dynamically selecting different LoRA adapters during the forward propagation process of all decoder modules.Figure 2: Architecture of the MeteoRA module integrating LoRA embeddings using MoE modelsThe training of the MeteoRA module follows the model fine-tuning criteria under autoregressive language modeling tasks, requiring the base large language model weights and pre-trained LoRA adapter parameters to remain unchanged. Since the MeteoRA module supports selecting the top several LoRA adapters with the highest weights, the team introduced a joint optimization scheme that combines the autoregressive language modeling loss with the cross-entropy loss of LoRA classification across all gating networks. This optimization function integrates the prediction loss in the autoregressive model with the cross-entropy loss of LoRA classification in the MeteoRA module, achieving the training of the gating networks.The core component of the MeteoRA module is the MoE architecture that integrates multiple LoRA adapters. First, the team consolidated LoRA into tensors allocated continuously on HBM. For each token in the input sequence of each batch, since the module needs to use the gating network to find the index set of LoRA adapters, the MeteoRA module is nearly impossible to maintain the same efficiency as individual LoRA modules. The simple implementation based on the original loop [8] uses a for-loop to iterate through LoRA adapters, applying LoRA forward passes to the candidate set of that adapter in each iteration. This method simply splits all tokens of all batches into sets equal to the number of LoRA adapters, making each set pass sequentially. However, considering the independent nature of input tokens, this method fails to fully utilize the parallelized GEMM operator [9] acceleration, especially when certain LoRA adapters are selected by only a few tokens, or when the number of tokens to be inferred is less than the number of LoRA adapters, potentially leading to up to 10 times the running time in experiments.This work adopts the forward propagation acceleration strategy bmm-torch, directly indexing the top-k adapters for all batch tokens, utilizing two bmm computations. Compared to the original loop method, bmm-torch parallelizes the top-k adapters for all batch tokens based on the bmm operator provided by PyTorch [10], achieving about 4 times speedup. In experiments, this strategy is only 2.5 times slower than its upper limit single LoRA module. Due to PyTorch’s indexing constraints [11], bmm-torch requires allocating larger memory for batch processing, which may become a bottleneck when the batch size or input length is too large. Therefore, this work developed a custom GPU kernel operator using Triton [12] that not only maintains about 80% of the computational efficiency of bmm-torch but also keeps the low memory overhead at the level of the original loop.Experimental ResultsThis work conducted extensive experimental validation of the proposed architecture and design on independent and composite tasks. The experiments used two well-known large language models, LlaMA2-13B and LlaMA3-8B. Figure 3 shows the radar chart of the prediction performance of the MeteoRA module integrated with 28 tasks compared to single LoRA and the reference model PEFT on various independent tasks. Evaluation results indicate that regardless of which large language model is used, the MeteoRA model exhibits performance comparable to PEFT, while MeteoRA does not require explicit activation of specific LoRAs. Table 1 presents the average scores of all methods across various tasks.Figure 3: Evaluation performance of the MeteoRA model on 28 selected tasksTable 1: Average performance of each model on 28 selected tasksTo verify the model’s ability to sequentially solve composite problems, this work constructed three datasets by serially connecting independent tasks, integrating 3, 5, and 10 tasks, expecting the model to solve different categories of problems input in the same sequence. The experimental results are shown in Table 2, indicating that as the number of composite tasks increases, the MeteoRA module outperforms the reference LoRA-B model in almost all aspects.Table 2: Evaluation results of composite tasks, left is the MeteoRA module, right is LoRA-BTo further validate the functionality of the gating network in the MeteoRA module, this work demonstrated the LoRA selection mode using a top-2 strategy during composite task inference. At the junction of two adjacent tasks, the gating network correctly executed the switching operation of LoRA.Figure 4: Example of LoRA selection in a composite of 3 tasksTo verify the computational efficiency of the forward propagation design using custom GPU operators, this work sampled portions of 28 tasks and compared the new forward propagation strategy with its upper limit and original implementation. The evaluation results are shown in Figure 5, demonstrating the outstanding performance of the new acceleration strategy.Figure 5: Overall running time of four different forward propagation strategies on 28 tasksReferences:[1] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.[2] Chen Ling, Xujiang Zhao, Jiaying Lu, Chengyuan Deng, Can Zheng, Junxiang Wang, Tanmoy Chowdhury, Yun Li, Hejie Cui, Xuchao Zhang, Tianjiao Zhao, Amit Panalkar, Dhagash Mehta, Stefano Pasquali, Wei Cheng, Haoyu Wang, Yanchi Liu, Zhengzhang Chen, Haifeng Chen, Chris White, Quanquan Gu, Jian Pei, Carl Yang, and Liang Zhao. Domain specialization as the key to make large language models disruptive: A comprehensive survey, 2024.[3] Sourab Mangrulkar, Sylvain Gugger, Lysandre Debut, Younes Belkada, Sayak Paul, and Benjamin Bossan. Peft: State-of-the-art parameter-efficient fine-tuning methods. https://github.com/huggingface/peft, 2022.[4] Ying Sheng, Shiyi Cao, Dacheng Li, Coleman Hooper, Nicholas Lee, Shuo Yang, Christopher Chou, Banghua Zhu, Lianmin Zheng, Kurt Keutzer, et al. S-lora: Serving thousands of concurrent lora adapters. arXiv preprint arXiv:2311.03285, 2023.[5] Chengsong Huang, Qian Liu, Bill Yuchen Lin, Tianyu Pang, Chao Du, and Min Lin. Lorahub: Efficient cross-task generalization via dynamic lora composition. arXiv preprint arXiv:2307.13269, 2023.[6] Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. Adaptive mixtures of local experts. Neural computation, 3 (1):79–87, 1991.[7] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017.[8] Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, MarieAnne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven Le Scao, Théophile Gervet, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mixtral of experts, 2024.[9] Shixun Wu, Yujia Zhai, Jinyang Liu, Jiajun Huang, Zizhe Jian, Bryan Wong, and Zizhong Chen. Anatomy of high-performance gemm with online fault tolerance on gpus. In Proceedings of the 37th International Conference on Supercomputing, pages 360–372, 2023b.[10] PyTorch. torch.bmm — pytorch 2.3 documentation. https://pytorch.org/docs/stable/generated/torch.bmm.html, 2024a. Accessed: 2024-05-23.[11] PyTorch. Tensor indexing api — pytorch documentation. https://pytorch.org/cppdocs/notes/tensor_indexing.html, 2024b. Accessed: 2024-05-23.[12] Philippe Tillet, Hsiang-Tsung Kung, and David Cox. Triton: an intermediate language and compiler for tiled neural network computations. In Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, pages 10–19, 2019.© THE END For reprints, please contact this public account for authorizationSubmissions or inquiries for reporting: [email protected]

Related posts

Leave a Comment Cancel reply