Ten Influential Papers in the Field of Embedded AI
Abstract
Embedded AI is experiencing explosive growth, with significant breakthroughs in key technological directions such as model compression, edge computing, neural architecture search, and low-power algorithms from 2022 to 2025. This article selects ten influential papers that have not only won best paper awards at top conferences but have also made an impact in practical deployments, driving technological innovations from 70B parameter large models to microcontroller deployments. The following papers cover the implementation of AWQ quantization technology for deploying large models on mobile devices, MCUNetV2 breaking the limits of visual recognition on microcontrollers, and the application of spiking neural networks in ultra-low-power computing.
1. Revolutionary Breakthroughs in Model Compression and Quantization Technologies
1. AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
Title (Chinese): AWQ: Activation-aware Weight Quantization for Large Language Model Compression and Acceleration
Authors and Year: Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, Song Han (2024)
Published Conference: MLSys 2024 (Best Paper Award)
Main Contributions and Innovations: This work proposes a revolutionary activation-aware weight quantization method that significantly reduces quantization error by protecting only 1% of the significant weights. The method identifies key channels using activation distribution rather than the weights themselves, achieving 4-bit quantization without backpropagation through mathematical equivalence transformations. Most importantly, AWQ successfully achieves quantization of multimodal large language models for the first time.
Importance Analysis: AWQ enables the 70B parameter Llama-2 model to run on mobile GPUs (NVIDIA Jetson Orin), achieving a 3x speedup. This technology has been adopted by mainstream frameworks such as NVIDIA TensorRT-LLM, AMD, Google Vertex AI, and Amazon Sagemaker, and has over 6 million downloads on HuggingFace, making it a key technology for deploying large models to edge devices.
2. SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
Title (Chinese): SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
Authors and Year: Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, Song Han (2023)
Published Conference: ICML 2023
Main Contributions and Innovations: The first method to achieve W8A8 (8-bit weights, 8-bit activations) large model quantization without training while maintaining accuracy. The core innovation lies in discovering the pattern that weights are easier to quantize while activations are harder, transferring the quantization difficulty from activations to weights by smoothing activation outliers, achieving INT8 quantization for all matrix multiplications.
Importance Analysis: This technology allows a 530B parameter model to run on a single node, achieving a 1.56x speedup and a 2x reduction in memory. It has been successfully applied to mainstream model families such as OPT, BLOOM, and LLaMA, and integrated into production systems like FasterTransformer, laying the foundation for subsequent quantization methods.
3. QLoRA: Efficient Finetuning of Quantized LLMs
Title (Chinese): QLoRA: Efficient Finetuning of Quantized Large Language Models
Authors and Year: Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer (2023)
Published Conference: NeurIPS 2023
Main Contributions and Innovations: Innovatively combines 4-bit quantized pre-trained models with low-rank adapters (LoRA), proposing three key technologies: 4-bit NormalFloat (NF4) data type, dual quantization, and paging optimizer. NF4 is the information-theoretically optimal normal distribution weight quantization format, enabling a 65B parameter model to be fine-tuned on a single 48GB GPU.
Importance Analysis: QLoRA democratizes large model fine-tuning, allowing individual developers to customize models on consumer-grade hardware. In evaluations across over 1,000 models and 8 instruction datasets, it achieved 99.3% of ChatGPT’s performance, widely applied in efficient fine-tuning scenarios in both research and industry.
2. Breakthroughs in TinyML and Edge Computing
4. MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning
Title (Chinese): MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning
Authors and Year: Ji Lin, Wei-Ming Chen, Han Cai, Chuang Gan, Song Han (Published in 2021, ongoing development in 2022-2023)
Published Conference: NeurIPS 2021
Main Contributions and Innovations: Introduces patch-based inference to overcome the memory bottleneck of deep learning on microcontrollers, reducing peak memory usage by 4.8 times. Through the co-designed neural architecture search (TinyNAS) and memory-efficient inference engine (TinyEngine), for the first time, ImageNet-scale inference is achieved on microcontrollers with 256KB of memory.
Importance Analysis: This is the first system to achieve over 70% top-1 accuracy on ImageNet on commercial microcontrollers, being 2.4-3.4 times faster than MobileNetV2 in visual wake word detection. This work establishes a new benchmark for AI inference on microcontrollers, making “always-on” AI applications possible.
5. Edge Impulse: An MLOps Platform for Tiny Machine Learning
Title (Chinese): Edge Impulse: An MLOps Platform for Tiny Machine Learning
Authors and Year: Colby R. Banbury et al. (2022)
Published Journal: arXiv preprint, TinyML Community Conference
Main Contributions and Innovations: The first complete MLOps platform designed for large-scale TinyML development, addressing the challenges of fragmented software stacks and heterogeneous deployments in embedded ML. It provides full TinyML design cycle support from data collection to deployment, enabling scalable deployment across hardware platforms.
Importance Analysis: As of October 2022, the platform hosted 118,185 projects from 50,953 developers worldwide, used for education by the TinyML4D academic network in 48 countries. This platform democratizes TinyML development, eliminating the need for specialized embedded ML knowledge and becoming a key infrastructure driving the widespread application of TinyML.
3. Innovations in Neural Architecture Search and Hardware Acceleration
6. H4H-NAS: Hybrid CNN-Transformer Architecture Search for NPU-CIM Heterogeneous Systems
Title (Chinese): H4H-NAS: Hybrid CNN-Transformer Architecture Search for NPU-CIM Heterogeneous Systems
Authors and Year: Multiple researchers from AR/VR hardware and AI teams (2024)
Published Conference: DAC 2024 (61st Design Automation Conference)
Main Contributions and Innovations: The first neural architecture search framework designed for the co-design of hybrid CNN/Vision Transformer models for heterogeneous NPU-CIM (Neural Processing Unit + Compute-in-Memory) systems. It builds performance estimators using real ARM Ethos-U55 silicon data and MRAM macro units, achieving algorithm/hardware co-design.
Importance Analysis: Compared to pure NPU systems, it achieves up to 56.08% latency reduction and 41.72% energy efficiency improvement. This work systematically addresses the optimization problem of hybrid CNN/ViT in heterogeneous edge computing, providing a practical framework for ultra-low-latency inference in AR/VR applications.
7. MicroNAS: Zero-Shot Neural Architecture Search for MCUs
Title (Chinese): MicroNAS: Zero-Shot Neural Architecture Search for Microcontrollers
Authors and Year: Multiple research teams (2024)
Published Conference/Journal: IEEE Conference, Scientific Reports, arXiv
Main Contributions and Innovations: A hardware-aware zero-shot NAS framework that does not require expensive architecture training during the search process. By integrating dedicated performance metrics and dynamic convolution, it achieves a 1104x improvement in search efficiency compared to traditional methods. It can obtain state-of-the-art results within 3.6 hours on a standard laptop.
Importance Analysis: This technology improves MCU inference speed by 3.23 times and has been successfully deployed on ultra-low-power MCUs (STM32 L series). MicroNAS democratizes NAS technology for resource-constrained IoT and wearable devices, bridging the critical gap between powerful NAS technology and the limitations of embedded devices.
4. Cutting-edge Exploration of Low-Power AI Algorithms
8. Examining the Robustness of Spiking Neural Networks on Non-ideal Memristive Crossbars
Title (Chinese): Examining the Robustness of Spiking Neural Networks on Non-ideal Memristive Crossbars
Authors and Year: Abhiroop Bhattacharjee, Youngeun Kim, Abhishek Moitra, Priyadarshini Panda (2022)
Published Conference: ISLPED 2022 (Best Paper Award)
Main Contributions and Innovations: The first comprehensive analysis of the robustness of spiking neural networks (SNNs) on non-ideal memristive crossbars. It finds that repeated crossbar computations can lead to error accumulation during SNN inference, demonstrating that SNNs with fewer time steps achieve better accuracy on memristive crossbars.
Importance Analysis: This work addresses the practical deployment challenges of SNNs on emerging hardware, which is crucial for ultra-low-power edge applications. SNNs are expected to achieve 10-100 times energy efficiency compared to traditional neural networks, making them a key technology for achieving sub-milliwatt AI inference.
9. SEENN: Towards Temporal Spiking Early Exit Neural Networks
Title (Chinese): SEENN: Towards Temporal Spiking Early Exit Neural Networks
Authors and Year: Yuhang Li, Tamar Geller, Youngeun Kim, Priyadarshini Panda (2023)
Published Conference: NeurIPS 2023
Main Contributions and Innovations: The first combination of early exit mechanisms with spiking neural networks, proposing a temporal early exit strategy based on spiking dynamics. It dynamically allocates computational resources in the time domain based on input complexity, pioneering a new paradigm of temporal adaptive computation.
Importance Analysis: This work combines two major energy-saving paradigms (early exit and spiking networks), achieving adaptive energy consumption based on input complexity. Simple inputs can be inferred in sub-millisecond, while complex inputs still maintain high accuracy, opening new directions for edge applications requiring adaptive computation.
10. SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression
Title (Chinese): SpQR: A Sparse-Quantized Representation for Near-Lossless Large Language Model Weight Compression
Authors and Year: Tim Dettmers, Ruslan Svirschevski et al. (2024)
Published Conference: ICLR 2024
Main Contributions and Innovations: The first method to achieve near-lossless compression of large language models to 3-4 bits. The innovative sparse quantization representation isolates anomalous weights to store outliers with higher precision while compressing other weights to 3-4 bits. It achieves a perplexity relative loss of less than 1% on LLaMA and Falcon models.
Importance Analysis: This technology enables a 33B parameter model to run on a single 24GB consumer-grade GPU, achieving a 15% speedup while maintaining performance without degradation. It achieves over 4 times memory compression compared to the 16-bit baseline, making powerful large language models accessible on consumer-grade hardware.
Technological Integration and Future Prospects of Embedded AI
Between 2022 and 2025, research in embedded AI presents three core trends. First, hardware-aware design becomes mainstream, with co-design of algorithms and hardware becoming the standard paradigm, from AWQ’s mobile GPU optimization to MicroNAS’s MCU customization. Second, extreme compression technologies achieve breakthroughs, pushing the boundaries of model deployment to new limits with SpQR’s 3-bit quantization and MCUNetV2’s patch-based inference, making microwatt-level AI a reality. Finally, the ecosystem matures, as evidenced by the 110,000+ projects on the Edge Impulse platform, demonstrating the transition of TinyML from the lab to large-scale industrial applications.