Transformers Articles

Opportunities and Challenges of Edge Deployment of GenAI: NPU as the Key to Breakthrough

2025-09-07 by boardor

In the past decade, artificial intelligence (AI) and machine learning (ML) have undergone significant transformations—convolutional neural networks (CNNs) and recurrent neural networks (RNNs) are transitioning to Transformers and generative artificial intelligence (GenAI). This transformation is driven by the industry’s urgent demand for models that are efficient, accurate, context-aware, and capable of handling complex tasks.Initially, AI … Read more

Fine-tuning CPU Lora ChatGLM2-6B

2025-06-12 by boardor

The open-source dataset found contains less than 50,000 Q&A pairs, and it is recommended to have over 200G of memory. My local setup with 60G of memory cannot run it. The lora uses Hugging Face’s peft: https://github.com/huggingface/peft Two versions of the training part were written: One references the peft example: https://github.com/huggingface/peft/tree/main/examples. With 60G memory and … Read more

LoRA: Low-Rank Adaptation for Large Models

2025-06-12 by boardor

Source: DeepHub IMBA This article is approximately 1000 words and is recommended to be read in 5 minutes. Low-Rank Adaptation significantly reduces the number of trainable parameters for downstream tasks. For large models, it becomes impractical to fine-tune all model parameters. For example, GPT-3 has 175 billion parameters, making both fine-tuning and model deployment impossible. … Read more

Streaming Output for Model Inference in Transformers

2025-05-08 by boardor

This article will introduce how to implement streaming output for model inference in the transformers module. The transformers module provides a built-in Streaming method for streaming output during model inference. Additionally, we can use model deployment frameworks such as vLLM and TGI to better support streaming output for model inference. Below, we will detail how … Read more

ReLoRA: Efficient Large Model Training Through Low-Rank Updates

2025-05-03 by boardor

This article focuses on reducing the training costs of large Transformer language models. The author introduces a low-rank update-based method called ReLoRA. A core principle in the development of deep learning over the past decade has been to “stack more layers,” and the author aims to explore whether stacking can similarly enhance training efficiency for … Read more

Understanding the Principles of LoRA

2025-04-15 by boardor

Introduction With the continuous expansion of model scale, the feasibility of fine-tuning all parameters of the model (so-called full fine-tuning) is becoming increasingly low. Taking GPT-3 with 175 billion parameters as an example, each new domain requires a complete fine-tuning of a new model, which is very costly! Paper: LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE … Read more