TinyML: The Next AI Revolution

Author | Matthew Stewart

Translator | 盖磊

Editor | 陈思

Source | AI Frontline

Join the community | To join the “Automotive Operating System Community”, please add WeChat 13636581676, with the note system

A trend in artificial intelligence is rapidly moving from the “cloud” to the “edge”. TinyML is artificial intelligence implemented on microcontrollers of massive IoT devices, which is expected to become an important new application of AI in the industrial field in the coming years. Edge devices are often limited in computational resources and power, making them extremely sensitive to power consumption. Implementing AI models on such devices faces new challenges and presents new applications. This article is the first in a series on TinyML, introducing the concept, technology, and future potential of TinyML.

Led by NASA, the trend towards miniaturization has swept the entire consumer electronics industry. Now, all of Beethoven’s works can be stored in a needle and listened to through headphones. — Astrophysicist and science commentator Neil deGrasse Tyson

… The proliferation of ultra-low-power embedded devices, along with the introduction of embedded machine learning frameworks like TensorFlow Lite for microcontrollers, means that AI-driven IoT devices will become widely adopted. — Harvard University Associate Professor Vijay Janapa Reddi

Figure 1 Overview of TinyML on embedded devices

Models are not better when they are larger.

This article is the first in the TinyML series, aimed at introducing readers to the concept of TinyML and its future potential. Subsequent articles in this series will delve into specific applications, implementations, and tutorials.

Introduction

In the past decade, due to improvements in processor speeds and the emergence of big data, we have witnessed an exponential growth in the scale of machine learning algorithms. Initially, models were not large and ran on local computers using one or more CPU cores.

Shortly thereafter, GPU computing allowed for processing larger datasets, and through cloud-based services such as Google Colaboratory and Amazon EC2 Instances, GPU technology became more accessible. Meanwhile, algorithms could still run on a single machine.

Recently, dedicated ASICs and TPUs have provided processing power equivalent to about 8 GPUs. The development of these devices has enhanced the ability to distribute learning algorithms across multiple systems, meeting the demands of increasingly large models.

The GPT-3 algorithm released in May 2020 pushed the scale of models to unprecedented levels. The network architecture of GPT-3 contains an astonishing 175 billion neurons, more than twice the approximately 85 billion neurons in the human brain, and more than 10 times the number of neurons in Turing-NLG. Turing-NLG, released in February 2020, is the second-largest neural network ever, containing about 17.5 billion parameters. It is estimated that the training cost of the GPT-3 model is about 10 million dollars, using about 3GWh of electricity, equivalent to the output of three nuclear power stations in one hour.

While the achievements of GPT-3 and Turing-NLG are commendable, they have naturally sparked criticism from industry insiders regarding the growing carbon footprint of the AI industry. On the other hand, they have also sparked interest in more energy-efficient computing in the field of artificial intelligence. In recent years, concepts such as more efficient algorithms, data representation, and computation have been the focus of attention in the seemingly unrelated field of TinyML.

TinyML is an intersection of machine learning and embedded IoT devices, representing an emerging engineering discipline with the potential to revolutionize many industries.

The primary beneficiaries of TinyML are in the fields of edge computing and energy-efficient computing. TinyML originates from the concept of IoT. The traditional approach in IoT is to send data from local devices to the cloud for processing. Some people have concerns about this approach regarding privacy, latency, storage, and energy efficiency.

Energy Efficiency. Data transmission, whether wired or wireless, is very energy-consuming, approximately an order of magnitude higher than using native computation with multiply-accumulate units (MAUs). The most energy-efficient approach is to develop IoT systems with local data processing capabilities. The “data-centric” computing approach has been explored by some AI pioneers and is currently being applied, as opposed to the “compute-centric” cloud model.
Privacy. There are privacy risks involved in data transmission. Data may be intercepted by malicious actors, and storing it in a single location such as the cloud reduces the inherent security of the data. By keeping most of the data on the device, communication needs can be minimized, thereby enhancing security and privacy.
Storage. Much of the data collected by many IoT devices is useless. Imagine a security camera recording the entrance of a building 24/7. For most of the day, the camera is not doing anything because no unusual events occur. Adopting smarter systems that activate only when necessary can reduce storage capacity needs and, in turn, decrease the amount of data that needs to be transmitted to the cloud.
Latency. Standard IoT devices, such as Amazon Alexa, need to transmit data to the cloud for processing and then respond based on the algorithm’s output. In this sense, the device is merely a convenient gateway to the cloud model, akin to a carrier pigeon between the device and Amazon’s servers. The device itself is not intelligent, and the response speed entirely depends on internet performance. If the internet is slow, then the response of Amazon Alexa will also be slow. Smart IoT devices with built-in automatic speech recognition reduce or even completely eliminate dependence on external communication, thus lowering latency.

The above issues are driving the development of edge computing. The concept of edge computing is to implement data processing functions on devices deployed at the “edge” of the cloud. These edge devices are highly constrained in memory, computation, and functionality, necessitating the development of more efficient algorithms, data structures, and computational methods.

Such improvements also apply to larger models, achieving several orders of magnitude increase in machine learning model efficiency without sacrificing model accuracy. For example, Microsoft’s Bonsai algorithm can be as small as 2 KB, yet performs better than the typically 40MB kNN algorithm or the 4MB neural network. This result may seem negligible, but in other words, achieving the same accuracy on a model that is scaled down by a factor of ten thousand is quite impressive. Models of such small size can run on a 2KB memory Arduino Uno. In short, such machine learning models can now be built on microcontrollers priced at $5.

Machine learning is at a crossroads, with two computing paradigms advancing simultaneously: compute-centric computing and data-centric computing. In the compute-centric paradigm, data is stored and analyzed on instances in data centers; whereas in the data-centric paradigm, processing is performed at the raw location of the data. Although the compute-centric paradigm seems to be nearing its limits, the data-centric paradigm is just getting started.

Currently, IoT devices and embedded machine learning models are becoming increasingly common. It is expected that by the end of 2020, there will be over 20 billion active devices. Many people may not notice many of these devices, such as smart doorbells, smart thermostats, and smartphones that can be “woken up” by simply speaking or picking them up. This article will delve into how TinyML works, as well as its current and future applications.

Figure 2 Hierarchical structure of the cloud.

TinyML Examples

Previously, various operations performed by devices had to rely on complex integrated circuits. Now, the hardware “intelligence” of machine learning is gradually being abstracted into software, making embedded devices simpler, lighter, and more flexible.

Implementing machine learning on embedded devices presents huge challenges, but significant progress has been made in this field. Deploying neural networks on microcontrollers presents key challenges in terms of low memory usage, power limitations, and computation constraints.

Smartphones are the most typical example of TinyML. Phones are constantly in an active listening state for “wake words”, such as “Hey Google” for Android smartphones and “Hey Siri” for iPhones. If the wake word service were run on the smartphone’s CPU (the mainstream iPhone CPU reaching 1.85 GHz), the battery would be drained in just a few hours. Such power consumption is unacceptable, especially since the wake word service is used only a few times a day by most people.

To solve this problem, developers have created dedicated low-power hardware that can be powered by small batteries (e.g., CR2032 button batteries). Even when the CPU is not running (usually indicated by the screen being off), the integrated circuit remains active.

This integrated circuit consumes only 1mW of power, and if powered by a standard CR2032 battery, can last for up to a year.

While some may not find this remarkable, it is a significant advancement. The bottleneck for many electronic devices is energy. Any device that requires mains power is limited by the location of the power wiring. If several devices are deployed in the same location, the power supply may quickly become overloaded. Mains power is not very efficient and is costly. Converting mains voltage (e.g., 120V used in the US) to the typical circuit voltage range (usually about 5V) wastes a lot of energy. Laptop users are all too familiar with this when unplugging their chargers. The heat generated by the internal transformer of the charger is energy wasted during the voltage conversion process.

Even if a device has its own battery, battery life is limited and requires frequent recharging. Many consumer electronic device batteries are designed to last for one workday. Some TinyML devices can operate continuously for a year on a coin-sized battery, allowing them to be deployed in remote environments and communicate only when necessary to save power.

In a smartphone, the wake word service is not the only seamlessly integrated TinyML application. Accelerometer data can be used to determine whether the user has just picked up the phone, thereby waking the CPU and lighting up the screen.

Clearly, these are not the only applications for TinyML. In fact, TinyML offers a plethora of exciting applications for product enthusiasts and businesses to create smarter IoT devices. In an era when data is becoming increasingly important, the ability to distribute machine learning resources to remote, memory-constrained devices presents tremendous opportunities in data-intensive industries such as agriculture, weather forecasting, and earthquake detection.

Undoubtedly, empowering edge devices to perform data-driven processing will transform the computing paradigm in industrial processes. For example, if it becomes possible to monitor crops and detect characteristics such as soil moisture, specific gases (e.g., ethylene released when apples ripen), or specific atmospheric conditions (e.g., strong winds, low temperatures, or high humidity), it would greatly enhance crop growth and increase yields.

Another example is installing cameras in smart doorbells to use facial recognition to identify visitors. This would enable security features and even allow the doorbell camera output to be displayed on the indoor television screen when someone arrives, so the homeowner can see who is at the door.

Currently, the two main focus areas for TinyML applications are:

Keyword Spotting. Most people are already very familiar with this application, such as “Hey Siri” and “Hey Google” keywords, often referred to as “hot words” or “wake words”. Devices continuously listen to audio input from the microphone, trained to respond only to specific sound sequences that match the learned keywords. These devices are simpler and use fewer resources than automatic speech recognition (ASR). Devices such as Google smartphones also use a cascading architecture for speaker verification to ensure security.
Visual Wake Words. Visual wake words use images to replace the function of wake words by performing binary classification to indicate presence or absence. For example, designing a smart lighting system that activates upon detecting a person’s presence and turns off when they leave. Similarly, wildlife photographers can use visual wake functionality to start shooting when specific animals appear, and security cameras can start recording when human activity is detected.

The following diagram provides a comprehensive overview of current TinyML machine learning applications.

Figure 3 Machine learning use cases for TinyML. Image source: NXP

How TinyML Works

The working mechanism of TinyML algorithms is almost identical to traditional machine learning models, typically involving model training on the user’s computer or in the cloud. The real utility of TinyML comes after training, often referred to as “deep compression”.

Figure 4 Deep compression illustration, source: [ArXiv paper](https://arxiv.org/pdf/1510.00149.pdf).

Model Distillation

After training, models need to be modified to create more compact representations. The main techniques for achieving this process include pruning and knowledge distillation.

The basic idea of knowledge distillation is to consider the sparsity or redundancy present in larger networks. While large networks have high representational capacity, if the network capacity is not saturated, a smaller network (i.e., with fewer neurons) can represent it. In the research work published by Hinton et al. in 2015, the embedded information transferred from the teacher model to the student model is referred to as “dark knowledge”.

The following diagram illustrates the process of knowledge distillation:

Figure 5 Illustration of the knowledge distillation process

The teacher model in the diagram is a trained convolutional neural network model tasked with transferring its “knowledge” to a smaller convolutional network model called the student model, which has fewer parameters. This process, known as “knowledge distillation”, compresses the same knowledge into a smaller network for use on more memory-constrained devices.

Similarly, pruning helps achieve a more compact model representation. Broadly speaking, pruning aims to remove neurons that are nearly useless for output predictions. This process usually involves smaller neural weights, while larger weights are retained due to their higher importance in inference. The network can subsequently be retrained on the pruned architecture to fine-tune the output.

Figure 6 Illustration of pruning knowledge representations in distilled models

Quantization

After distillation, models need to undergo quantization to form a format compatible with embedded device architectures.

Why perform quantization? Suppose we have an Arduino Uno using the 8-bit integer arithmetic ATmega328P microcontroller. To run a model on the Uno, unlike many desktop and laptop computers using 32-bit or 64-bit floating-point representations, the model weights must be stored as 8-bit integer values. By quantizing the model, the storage size of the weights can be reduced to 1/4, from 32 bits to 8 bits, with minimal impact on accuracy, usually around 1-3%.

Figure 7 Illustration of quantization error during the 8-bit encoding process, used to reconstruct 32-bit floating-point numbers. Image source: [TinyML](https://tinymlbook.com/)

Due to quantization errors, some information may be lost during the quantization process. For example, on integer-based platforms, a floating-point representation of 3.42 may be truncated to 3. To address this issue, research has proposed quantization-aware training (QA) as an alternative. QA training essentially restricts the network to only use values available on quantized devices during the training process (see TensorFlow examples for specifics).

Huffman Coding

Coding is an optional step. Coding stores data in the most efficient manner, further reducing model size. Huffman coding is commonly used for this purpose.

Compilation

After quantization and coding, the model needs to be converted to a format interpretable by lightweight network interpreters, with the most widely used being TF Lite (about 500 KB in size) and TF Lite Micro (about 20 KB). The model is compiled into C or C++ code that can be used by most microcontrollers and effectively utilizes memory, run by the interpreter on the device.

Figure 8 Workflow diagram for TinyML applications. Source: [TinyML](https://tinymlbook.com/)

Most TinyML technologies are aimed at addressing the complexities of processing on microcontrollers. TF Lite and TF Lite Micro are very small because they remove all non-essential features. Unfortunately, they also remove some useful features, such as debugging and visualization. This means that if errors occur during deployment, it may be difficult to determine the cause.

Additionally, while models must be stored locally on devices, they must also support executing inference. This means that microcontrollers must have enough memory to run (1) the operating system and software libraries; (2) the neural network interpreter (such as TF Lite); (3) the stored neural network weights and architecture; (4) intermediate results during inference. Therefore, research papers in the TinyML direction usually need to provide peak memory usage for quantization algorithms while giving memory usage, multiply-accumulate units (MAC) count, accuracy, and other metrics.

Why Not Train on the Device?

Training on the device introduces additional complexities. Due to reduced numerical precision, ensuring sufficient accuracy for network training is extremely challenging. Under the precision of standard desktop computers, automatic differentiation methods are generally accurate. The precision of calculating derivatives can reach an incredible 10^{-16}, but performing automatic differentiation on 8-bit integers will yield poorer results. During backpropagation, derivatives are combined and ultimately used to update neural parameters. At such low numerical precision, the model’s accuracy may be severely compromised.

Despite these issues, some neural networks have been trained using 16-bit and 8-bit floating-point numbers.

The first paper to study reducing numerical precision in deep learning was published by Suyog Gupta and colleagues in 2015, titled “Deep Learning with Limited Numerical Precision”. The results presented in this paper are intriguing, as they demonstrate that it is possible to reduce 32-bit floating-point representations to 16-bit fixed-point representations with almost no loss in accuracy. However, this result only applies when using stochastic rounding, as it produces unbiased results on average.

In a paper published by Naigang Wang and colleagues in 2018, titled “Training Deep Neural Networks with 8-bit Floating Point Numbers”, neural networks were trained using 8-bit floating-point numbers. Training using 8-bit numbers is significantly more challenging than inference since it requires maintaining fidelity during backpropagation to achieve machine precision during automatic differentiation.

How is Computational Efficiency Achieved?

Model computational efficiency can be improved through model customization. A great example is MobileNet V1 and V2, which are model architectures widely deployed on mobile devices, essentially convolutional neural networks that achieve higher computational efficiency through recasting. This more efficient convolution form is known as depthwise separable convolution. Techniques such as hardware-based profiling and neural architecture search can also be used to optimize for architectural latency, but this article will not elaborate on that.

A New Wave of AI Revolution

Running machine learning models on resource-constrained devices opens up many new applications. Advances in making standard machine learning more energy-efficient will help alleviate some concerns about the environmental impact of data science. Additionally, TinyML supports embedded devices equipped with data-driven algorithms, which can be applied in various scenarios ranging from preventive maintenance to detecting bird calls in forests.

While continuing to scale up models is a firm direction for some machine learning practitioners, the development of machine learning algorithms that are more memory, computation, and energy-efficient is also a new trend. TinyML is still in its infancy, with few experts in this direction. The references in this article list some important papers in the field of TinyML, and interested readers are encouraged to read them. This field is rapidly growing and is expected to become an important new application of AI in the industrial sector in the coming years. Stay tuned.

Author Bio

Matthew Stewart, PhD student in Environmental and Data Science at Harvard University, Machine Learning Advisor at Critical Future, personal blog: https://mpstewart.net

References

[1] Hinton, Geoffrey & Vinyals, Oriol & Dean, Jeff. (2015). Distilling the Knowledge in a Neural Network.

[2] D. Bankman, L. Yang, B. Moons, M. Verhelst and B. Murmann, “An always-on 3.8μJ/86% CIFAR-10 mixed-signal binary CNN processor with all memory on chip in 28nm CMOS,” 2018 IEEE International Solid-State Circuits Conference — (ISSCC), San Francisco, CA, 2018, pp. 222–224, doi: 10.1109/ISSCC.2018.8310264.

[3] Warden, P. (2018). Why the Future of Machine Learning is Tiny. Pete Warden’s Blog.

[4] Ward-Foxton, S. (2020). AI Sound Recognition on a Cortex-M0: Data is King. EE Times.

[5] Levy, M. (2020). Deep Learning on MCUs is the Future of Edge Computing. EE Times.

[6] Gruenstein, Alexander & Alvarez, Raziel & Thornton, Chris & Ghodrat, Mohammadali. (2017). A Cascade Architecture for Keyword Spotting on Mobile Devices.

[7] Kumar, A., Saurabh Goyal, and M. Varma. (2017). Resource-efficient Machine Learning in 2 KB RAM for the Internet of Things.

[8] Zhang, Yundong & Suda, Naveen & Lai, Liangzhen & Chandra, Vikas. (2017). Hello Edge: Keyword Spotting on Microcontrollers.

[9] Fedorov, Igor & Stamenovic, Marko & Jensen, Carl & Yang, Li-Chia & Mandell, Ari & Gan, Yiming & Mattina, Matthew & Whatmough, Paul. (2020). TinyLSTMs: Efficient Neural Speech Enhancement for Hearing Aids.

[10] Lin, Ji & Chen, Wei-Ming & Lin, Yujun & Cohn, John & Gan, Chuang & Han, Song. (2020). MCUNet: Tiny Deep Learning on IoT Devices.

[11] Chen, Tianqi & Moreau, Thierry. (2020). TVM: An Automated End-to-End Optimizing Compiler for Deep Learning.

[12] Weber, Logan, and Reusch, Andrew (2020). TinyML — How TVM is Taming Tiny.

[13] Krishnamoorthi, Raghuraman. (2018). Quantizing deep convolutional networks for efficient inference: A whitepaper.

[14] Yosinski, Jason & Clune, Jeff & Bengio, Y. & Lipson, Hod. (2014). How transferable are features in deep neural networks?.

[15] Lai, Liangzhen & Suda, Naveen & Chandra, Vikas. (2018). CMSIS-NN: Efficient Neural Network Kernels for Arm Cortex-M CPUs.

[16] Chowdhery, Aakanksha & Warden, Pete & Shlens, Jonathon & Howar

Original link:

https://towardsdatascience.com/tiny-machine-learning-the-next-ai-revolution-495c26463868

Related posts

Leave a Comment Cancel reply