Efficient Transformer for TinyML: Long-Short Distance Attention

Click the card below to follow the “LiteAI” public account
Hi, everyone, I am Lite. Recently, I shared the Efficient Large Model Full-Stack Technology from Part 1 to 19, including large model quantization, fine-tuning, efficient inference of LLMs, quantum computing, generative AI acceleration, etc. The content links are as follows:
Efficient Large Model Full-Stack Technology (19): Efficient Training and Inference Framework for Models | TorchSparse++: Efficient Training and Inference Framework for Sparse Convolutions on GPU
Recently, I also shared the TinyML Project from Part 1 to 7, which includes efficient ML systems, efficient CNN algorithm & system co-design, efficient Transformers, etc. The content links are as follows:
TinyML Project (1): Efficient ML System | TinyChat: Visual Language Model and Edge AI 2.0
TinyML Project (2): Efficient ML System | TinyChat Engine: On-device LLM Inference Library
TinyML Project (3): Efficient CNN Algorithm & System Co-design | MCUNetV3: Device-side Training with 256KB Memory
TinyML Project (4): Efficient CNN Algorithm & System Co-design | TinyTL: Reducing Activation Times, Reducing Non-trainable Parameters, Achieving Efficient Learning on Device
TinyML Project (5): Efficient LLM Inference | Block Sparse Attention
TinyML Project (6): Efficient Intelligent Driving Applications | BEVFusion: Multi-task Multi-sensor Fusion with Unified Bird’s Eye View Representation
TinyML Project (7): Efficient Transformer | Flat Window Attention for Efficient Point Cloud Transformer

Article Overview

1. Background and Challenges

– Transformers are widely used in natural language processing, but the computational cost is high, making them unsuitable for mobile devices.

– Automated Neural Architecture Search (NAS) is effective but has huge search costs and environmental costs.

2. Lite Transformer Design

– Proposes a Long-Short Distance Attention (LSRA) mechanism that divides attention into global and local branches.

– The global branch uses the attention mechanism, while the local branch uses convolution to capture local dependencies.

3. Experiments and Results

– In machine translation tasks, Lite Transformer exceeds Transformer by 1.2/1.7 BLEU under 500M/100M MACs constraints.

– In language modeling tasks, Lite Transformer reduces perplexity by 1.8 compared to Transformer under 500M MACs.

– Lite Transformer combines pruning and quantization, compressing the model size by 18.2 times.

4. Comparison with AutoML

– Lite Transformer outperforms AutoML-based Evolved Transformer by 0.5 BLEU in mobile settings.

– The design cost of Lite Transformer is significantly reduced, resulting in a 20,000 times reduction in CO₂ emissions.

5. Conclusion

– Lite Transformer significantly outperforms Transformer in multiple language tasks and is suitable for mobile devices.

– Demonstrates that manual design combined with domain knowledge is more effective than AutoML for specific tasks.

Efficient Transformer for TinyML: Long-Short Distance Attention

Article link: https://arxiv.org/pdf/2004.11886

Project link: https://github.com/mit-han-lab/lite-transformer

TL;DR

Article Method

Lite Transformer is an efficient natural language processing (NLP) architecture designed for mobile devices.

1. Long-Short Distance Attention (LSRA) Mechanism:

– The core of Lite Transformer is the Long-Short Distance Attention (LSRA) mechanism, which divides attention into global and local branches.

– The global branch focuses on modeling long-distance dependencies using standard attention mechanisms.

– The local branch focuses on modeling local dependencies using Convolutional Neural Networks (CNN).

3. Model Architecture:

– Lite Transformer is based on a sequence-to-sequence (seq2seq) learning architecture, consisting of an encoder and a decoder.

– The self-attention layers in the encoder and decoder are replaced with LSRA modules.

– The LSRA module consists of two branches: the attention branch and the convolution branch. The input is split into two parts along the channel dimension, fed into the two branches, and finally merged through a feedforward layer (FFN).

Efficient Transformer for TinyML: Long-Short Distance Attention

Experimental Results

1. Machine Translation Task:

– IWSLT 2014 German-English:

– Under the constraint of 100M Mult-Adds, Lite Transformer outperforms Transformer by 3.1 BLEU.

– WMT 2014 English-German:

– Under the constraint of 500M Mult-Adds, Lite Transformer outperforms Transformer by 0.4 BLEU.

– Under the constraint of 100M Mult-Adds, Lite Transformer outperforms Transformer by 1.2 BLEU.

– WMT 2014 English-French:

– Under the constraint of 500M Mult-Adds, Lite Transformer outperforms Transformer by 1.2 BLEU.

– Under the constraint of 100M Mult-Adds, Lite Transformer outperforms Transformer by 1.7 BLEU.

2. Abstractive Summarization Task:

– On the CNN-DailyMail dataset, Lite Transformer maintains similar F1-Rouge scores compared to Transformer while reducing computational load by 2.4 times and model size by 2.5 times.

3. Language Modeling Task:

– On the WIKITEXT-103 dataset, Lite Transformer reduces perplexity by 1.8 under the constraint of 500M MACs compared to Transformer.

4. Model Compression:

– Combining pruning and quantization techniques, Lite Transformer compresses its model size by 18.2 times.

5. Comparison with AutoML:

– In mobile settings, Lite Transformer outperforms AutoML-based Evolved Transformer by 0.5 BLEU, and the design cost is significantly reduced, leading to a 20,000 times reduction in CO₂ emissions.

Efficient Transformer for TinyML: Long-Short Distance Attention
Efficient Transformer for TinyML: Long-Short Distance Attention

Final Thoughts

Here I recommend the latest course 6.5940 from Hanlab for the fall of 2024 (the course is ongoing).
Course link: https://efficientml.ai
Course materials: https://pan.quark.cn/s/1324c20d7efd
Your likes, views, and follows are my greatest motivation to continue!

Scan to add me, or add WeChat (ID: LiteAI01) for discussions on technology, career, and professional planning. Please note “Research direction + School/Region + Name”

Efficient Transformer for TinyML: Long-Short Distance Attention

Leave a Comment