Efficient ML Systems: TinyChat Engine and On-Device LLM Inference

Click belowcard, follow the “LiteAI” public account
Hi, everyone, I am Lite. I recently shared the first to nineteenth articles on efficient large model full-stack technology, including large model quantization and fine-tuning, efficient inference of LLMs, quantum computing, generative AI acceleration, etc. Here is the link:
Efficient Large Model Full-Stack Technology (Nineteen): Efficient Training and Inference Framework for Models | TorchSparse++: Efficient Training and Inference Framework for Sparse Convolutions on GPU
Subsequent content will be updated, stay tuned!
The first article in the TinyML project about TinyChat, which is about visual language models and Edge AI 2.0, can be found here:
TinyML Project (One): Efficient ML Systems | TinyChat: Visual Language Models and Edge AI 2.0
Today I will share the second part of the efficient ML systems — TinyChat Engine: on-device LLM inference library.

Efficient TinyChat Engine

  • LLM Deployment Steps

Transformer-based LLMs are usually applied to online chatbots or code writing, where these LLMs deploy models such as PyTorch to the cloud or mobile. Here we will introduce deploying LLMs to mobile using the TinyChat engine.
Efficient ML Systems: TinyChat Engine and On-Device LLM Inference
  • Introduction to TinyChat Engine
    • TinyChatEngine: An inference library designed for efficiently deploying quantized large language models (LLMs) on edge devices.
      • General Framework
      • No dependence on large libraries
      • High performance
      • Easy to use
    • TinyChat Engine includes:
      • Pure C/C++ implementation
      • LLM runtime control flow
      • Various device-specific kernels
      • Python toolchain model conversion and quantization
    • Workflow for deploying TinyChat Engine models: as shown in the figure below
Efficient ML Systems: TinyChat Engine and On-Device LLM Inference
  • Flexible backends and quantization methods:
    • W8A8 quantization process (SmoothQuant)

      – By default, activation values are stored as FP32 unless the device natively supports FP16

      – Uses int8 operations for most operator calculations

      – Provides kernels specifically for different output precision and activation functions

    • W4A16/W4A32 quantization process (AWQ)

      – Applies low-bit computation only to compute-intensive linear layers

      – Keeps the input and output of each operator as FP16/FP32

      – Provides high-performance int4 linear kernels for CPU and GPU

    • High-performance int4 linear operators

      – Utilizes int8 SIMD MAC operations to enhance performance

      – Speeds up 1.3 times on Intel i7-9750H and 3 times on M1 Pro

      – Improves efficiency through weight decoding and activation quantization

    • Device-specific weight reordering

      – Reorders weights during model conversion to fit the operational bit width/kernel implementation on the device

      – Eliminates runtime reordering to enhance performance

      – Example demonstrates the process of weight decoding using ARM NEON (128-bit wide SIMD)

      Efficient ML Systems: TinyChat Engine and On-Device LLM Inference

  • Efficient memory management

    – Pre-allocates runtime buffers during initialization

    – Reuses memory buffers to reduce memory footprint

    – Utilizes unified memory on edge GPUs to reduce peak memory (e.g., Jetson Orin and Apple GPU)

Efficient ML Systems: TinyChat Engine and On-Device LLM Inference
  • Efficient LLM deployment on edge devices

– TinyChatEngine achieves fast text generation of LLaMA2-7B on various devices

– Compared to PyTorch (FP32/FP16), TinyChatEngine shows significant performance improvements

Efficient ML Systems: TinyChat Engine and On-Device LLM Inference
  • Demo for deploying LLaMA2 chatbot

    – Provides ready-to-use quantized models for download and deployment

    – Demonstrates how to download the model, compile the program, and start the chatbot through examples

Reference link: https://github.com/mit-han-lab/TinyChatEngine/blob/main/assets/slides.pdf

In Conclusion

Here I recommend the latest course 6.5940 from hanlab for the fall of 2024 (the course is ongoing).
Course link: https://efficientml.ai
Course materials: https://pan.quark.cn/s/1324c20d7efd
Your likes, views, and follows are my greatest motivation to continue!

Scan the code to add me, or add WeChat (ID: LiteAI01) for technical, career, and professional planning discussions, please note “Research Direction + School/Region + Name”

Efficient ML Systems: TinyChat Engine and On-Device LLM Inference

Leave a Comment