Tensor Parallelism Articles

Single-GPU Operation for Thousands of Large Models: UC Berkeley’s S-LoRA Method

2025-06-12 by boardor

Originally from PaperWeekly Author: Dan Jiang Affiliation: National University of Singapore Generally speaking, the deployment of large language models follows the “pre-training – then fine-tuning” model. However, when fine-tuning a base model for numerous tasks (such as personalized assistants), the training and service costs can become very high. Low-Rank Adaptation (LoRA) is a parameter-efficient fine-tuning … Read more

S-LoRA: Enabling Thousands of Large Models on a GPU

2025-05-03 by boardor

Machine Heart reports Editor: Danjiang Generally, the deployment of large language models adopts a “pre-training – then fine-tuning” approach. However, when fine-tuning the base model for numerous tasks (such as personalized assistants), the training and service costs can become extremely high. Low-Rank Adaptation (LoRA) is a parameter-efficient fine-tuning method, typically used to adapt the base … Read more