Source: AI Frontline
The trend of miniaturization led by NASA has swept the entire consumer electronics industry. Now, all of Beethoven’s works can be stored in a needle and listened to with headphones. — Astrophysicist and science commentator Neil deGrasse Tyson
… The proliferation of ultra-low-power embedded devices, along with the introduction of embedded machine learning frameworks like TensorFlow Lite for microcontrollers, means that AI-driven IoT devices will become widespread. — Harvard University Associate Professor Vijay Janapa Reddi

Figure 1 Overview of TinyML on embedded devices
Models are not always better when larger.
This article is the first in a series on TinyML, aimed at introducing readers to the concept of TinyML and its future potential. Subsequent articles in this series will delve into specific applications, implementations, and tutorials.
Introduction
Over the past decade, due to advances in processor speeds and the emergence of big data, we have witnessed an exponential growth in the scale of machine learning algorithms. Initially, the models were not large, running on local computers using one or more CPU cores.
Shortly thereafter, GPU computing allowed for the processing of larger datasets, and cloud-based services like Google Colaboratory and Amazon EC2 Instances made GPU technology more accessible. Meanwhile, algorithms could still run on a single machine.
Recently, dedicated ASICs and TPUs have provided processing power equivalent to about 8 GPUs. The development of these devices has enhanced the ability to distribute learning algorithms across multiple systems to meet the increasing demands of larger models.
The release of the GPT-3 algorithm in May 2020 pushed the scale of models to unprecedented levels. The network architecture of GPT-3 contains an astonishing 175 billion neurons, more than double the approximately 85 billion neurons in the human brain, and more than ten times the number of neurons in Turing-NLG. Turing-NLG, released in February 2020, is the second-largest neural network ever, containing about 17.5 billion parameters. It is estimated that the training cost of the GPT-3 model is about 10 million dollars, using about 3GWh of electricity, equivalent to the output of three nuclear power plants in one hour.
Despite the achievements of GPT-3 and Turing-NLG, they have naturally sparked criticism from industry insiders about the growing carbon footprint of the AI industry. On the other hand, they have also sparked interest in more energy-efficient computing in the field of artificial intelligence. In recent years, concepts like more efficient algorithms, data representations, and computations have been the focus of research in the seemingly unrelated field of TinyML.
TinyML is an interdisciplinary field that intersects machine learning and embedded IoT devices, representing an emerging engineering discipline with the potential to revolutionize many industries.
The primary beneficiaries of TinyML are in the fields of edge computing and energy-efficient computing. TinyML originates from the concept of the Internet of Things (IoT). The traditional approach in IoT is to send data from local devices to the cloud for processing. Some have raised concerns about this approach regarding privacy, latency, storage, and energy efficiency.
-
Energy efficiency. Data transmission is very energy-intensive, regardless of whether it is wired or wireless, consuming about an order of magnitude more energy than local computation using multiply-accumulate units (MAU). The most energy-efficient approach is to develop IoT systems with local data processing capabilities. The “data-centric” computing paradigm has been explored by some AI pioneers and is now being applied.
-
Privacy. There are privacy risks associated with data transmission. Data may be intercepted by malicious actors, and when stored in a single location like the cloud, the inherent security of the data is also compromised. By keeping most of the data on the device, communication needs can be minimized, thereby enhancing security and privacy.
-
Storage. Much of the data collected by many IoT devices is useless. Imagine a security camera recording the entrance of a building 24 hours a day. For most of the day, the camera is not performing any function as no anomalies are occurring. Using smarter systems that activate only when necessary can reduce storage capacity needs and thus lower the amount of data that needs to be transmitted to the cloud.
-
Latency. Standard IoT devices, such as Amazon Alexa, need to transmit data to the cloud for processing, and then the algorithm’s output provides a response. In this sense, the device is merely a convenient gateway to the cloud, akin to a carrier pigeon between the device and Amazon’s servers. The device itself is not intelligent, and the response speed is entirely dependent on internet performance. If the internet speed is slow, the response of Amazon Alexa will also slow down. Smart IoT devices with built-in automatic speech recognition reduce or even eliminate dependence on external communication, thereby lowering latency.
These issues are driving the development of edge computing. The idea of edge computing is to implement data processing capabilities on devices deployed at the “edge” of the cloud. These edge devices are highly constrained in terms of memory, computation, and functionality, necessitating the development of more efficient algorithms, data structures, and computational methods.
Such improvements also apply to larger models, achieving several orders of magnitude improvement in machine learning model efficiency without compromising model accuracy. For example, Microsoft’s Bonsai algorithm can be as small as 2 KB, yet outperforms the typically 40MB kNN algorithm or the 4MB neural network. This result may seem insignificant, but in other words, achieving the same accuracy with a model reduced by a factor of ten thousand is quite impressive. Models this small can run on an Arduino Uno with 2KB of memory. In short, such machine learning models can now be built on microcontrollers priced at $5.
Machine learning is at a crossroads, with two computational paradigms advancing side by side: computation-centered computing and data-centered computing. In the computation-centered paradigm, data is stored and analyzed on instances in data centers; whereas in the data-centered paradigm, processing is executed at the original location of the data. Although the computation-centered paradigm seems to be approaching its limits, the data-centered paradigm is just getting started.
Currently, IoT devices and embedded machine learning models are becoming increasingly prevalent. It is expected that by the end of 2020, there will be over 20 billion active devices. Many people may not even notice many of these devices, such as smart doorbells, smart thermostats, and smartphones that can be “woken up” just by speaking or picking them up. The following sections will delve into how TinyML works and its current and future applications.

Figure 2 Hierarchical structure of the cloud.
Previously, various operations performed by devices required complex integrated circuits. Now, the “intelligence” of machine learning hardware is gradually being abstracted into software, making embedded devices simpler, lighter, and more flexible.
Implementing machine learning on embedded devices presents huge challenges, but significant progress has been made in this field. Deploying neural networks on microcontrollers, the key challenges include low memory usage, power constraints, and computational limitations.
Smartphones are the most typical example of TinyML. Phones are always in an active listening state for “wake words,” such as “Hey Google” for Android smartphones and “Hey Siri” for iPhones. If the voice wake-up service were run on the smartphone’s CPU (the mainstream iPhone CPU has reached 1.85 GHz), the battery would deplete within just a few hours. Such power consumption is unacceptable, as most people use the voice wake-up service only a few times a day.
To address this issue, developers created dedicated low-power hardware that can be powered by small batteries (such as CR2032 coin batteries). Even when the CPU is not running (typically indicated by the screen being off), the integrated circuit remains active.
This integrated circuit consumes only 1mW of power, and using a standard CR2032 battery, it can last for up to a year.
While some may not find this impressive, it represents a significant advancement. The bottleneck for many electronic devices is energy. Any device that requires mains power is limited by the location of the power wiring. If a dozen devices are deployed in the same location, the power supply may quickly become overloaded. The efficiency of mains power is low and costly. Converting power supply voltage (e.g., 120V used in the U.S.) to typical circuit voltage ranges (usually around 5V) wastes a lot of energy. Laptop users are well aware of this when unplugging their chargers. The heat generated by the charger’s internal transformer is wasted energy during the voltage conversion process.
Even when devices come with batteries, battery life is limited, requiring frequent recharging. Many consumer electronic devices are designed to last for one workday. Some TinyML devices can run continuously for a year on a coin-sized battery, meaning such devices can be deployed in remote environments and communicate only when necessary to save power.
In a smartphone, the wake word service is not the only seamlessly embedded TinyML application. Accelerometer data can be used to determine whether the user has just picked up the phone, thereby waking the CPU and lighting up the screen.
Clearly, these are not the only applications for TinyML. In fact, TinyML offers a wealth of exciting applications for product enthusiasts and businesses to achieve smarter IoT devices. In an era where data is becoming increasingly important, the ability to distribute machine learning resources to remote memory-constrained devices presents vast opportunities for data-intensive industries like agriculture, weather forecasting, or earthquake detection.
Undoubtedly, empowering edge devices to perform data-driven processing will transform the computational paradigm in industrial processes. For example, if it becomes possible to monitor crops and detect characteristics like soil moisture, specific gases (like ethylene released when apples ripen), or specific atmospheric conditions (like strong winds, low temperatures, or high humidity), it will greatly facilitate crop growth and increase yields.
Another example is installing cameras in smart doorbells to use facial recognition to identify visitors. This will enable security functions and can even output the doorbell camera feed to the indoor television screen when someone arrives, allowing the homeowner to see who is at the door.
Currently, the two main focus areas of TinyML are:
-
Keyword spotting. Most people are already familiar with this application, such as “Hey Siri” and “Hey Google,” commonly referred to as “hotwords” or “wake words.” Devices continuously listen for audio input from the microphone, trained to respond only to specific sound sequences matching the learned keywords. These devices are simpler and use fewer resources than automatic speech recognition (ASR). Devices like Google smartphones also use a cascade architecture for speaker verification to ensure security.
-
Visual wake words. Visual wake words use images to substitute the function of wake words, indicating presence or absence through binary classification of images. For example, designing a smart lighting system that activates upon detecting a person’s presence and turns off when they leave. Similarly, wildlife photographers can use visual wake functionality to start capturing when a specific animal appears, and security cameras can activate recording when human activity is detected.
The following image provides a comprehensive overview of current TinyML machine learning applications.

Figure 3 Machine learning use cases for TinyML. Image source: NXP.
The working mechanism of TinyML algorithms is almost identical to traditional machine learning models, typically involving training the model on the user’s computer or in the cloud. The real utility of TinyML comes after processing, commonly referred to as “deep compression.”

Figure 4 Illustration of deep compression. Source: [ArXiv Paper](https://arxiv.org/pdf/1510.00149.pdf).
Models need to be modified after training to create a more compact representation. The main techniques for this process include pruning and knowledge distillation.
The basic idea of knowledge distillation is to consider the sparsity or redundancy present in larger networks. While large-scale networks have higher representational capacity, if the network capacity is not saturated, a smaller network (i.e., fewer neurons) with lower representational capacity can represent the same information. In the research work published by Hinton et al. in 2015, the embedded information transferred from the Teacher model to the Student model is referred to as “dark knowledge.”
The following image illustrates the process of knowledge distillation:

Figure 5 Illustration of model knowledge distillation.
In the image, the Teacher model is a trained convolutional neural network model, tasked with transferring its “knowledge” to a smaller convolutional network model called the Student model, which has fewer parameters. This process, known as “knowledge distillation,” is used to compress the same knowledge into a smaller network for use on more memory-constrained devices.
Similarly, pruning helps achieve a more compact model representation. Broadly speaking, pruning aims to remove neurons that are nearly useless for output predictions. This process often involves smaller neural weights being removed while larger weights, which have higher importance during inference, are retained. Subsequently, the network can be retrained on the pruned architecture to fine-tune the output.

Figure 6 Illustration of pruning the knowledge representation of the distilled model.
After distillation, the model needs to be quantized to form a format compatible with embedded device architectures.
Why quantization? Assume that for an Arduino Uno, the ATmega328P microcontroller uses 8-bit numeric operations. Ideally, to run a model on Uno, unlike many desktop and laptop computers that use 32-bit or 64-bit floating point representations, the model’s weights must be stored as 8-bit integer values. By quantizing the model, the storage size of the weights is reduced to 1/4, i.e., from 32 bits to 8 bits, with little impact on accuracy, usually around 1-3%.

Figure 7 Illustration of quantization error during the 8-bit encoding process, which will be used to reconstruct 32-bit floating numbers. Image source: “[TinyML](https://tinymlbook.com/)”.
Due to the existence of quantization errors, some information may be lost during the quantization process. For example, on integer-based platforms, a floating representation of the value 3.42 may be truncated to 3. To address this issue, some research has proposed quantization-aware (QA) training as an alternative. QA training essentially restricts the network to only use values available on quantized devices during the training process (see TensorFlow examples for specifics).
Coding is an optional step. Coding stores data in the most efficient way possible, further reducing model size. Huffman coding is commonly used.
After quantizing and coding the model, it needs to be converted into a format that can be interpreted by lightweight network interpreters, the most commonly used being TF Lite (about 500 KB) and TF Lite Micro (about 20 KB). The model will be compiled into C or C++ code that can be used by most microcontrollers and efficiently utilize memory, executed by the interpreter on the device.

Figure 8 Workflow diagram of TinyML applications. Source: “[TinyML](https://tinymlbook.com/)” by Pete Warden and Daniel Situnayake.
Most TinyML technologies address the complexities caused by processing on microcontrollers. TF Lite and TF Lite Micro are very small because they remove all non-essential features. Unfortunately, they also remove some useful features, such as debugging and visualization. This means that if an error occurs during deployment, it may be challenging to determine the cause.
Additionally, although models must be stored locally on devices, they must also support executing inference. This means that microcontrollers must have sufficient memory to run (1) operating systems and software libraries; (2) neural network interpreters (like TF Lite); (3) stored neural network weights and architecture; (4) intermediate results during inference. Therefore, research papers in the TinyML direction typically provide peak memory usage of quantization algorithms alongside metrics like memory usage, multiply-accumulate units (MAC) count, and accuracy.
Training on the device introduces additional complexities. Due to reduced numerical precision, ensuring sufficient accuracy for network training is extremely challenging. With the precision of standard desktop computers, automatic differentiation methods are generally accurate. The precision of calculating derivatives can reach an astonishing 10^{-16}, but performing automatic differentiation on 8-bit numbers will yield poor precision results. During backpropagation, derivatives are combined and ultimately used to update neural parameters. At such low numerical precision, the model’s accuracy may be very poor.
Despite these issues, some neural networks have been trained using 16-bit and 8-bit floating-point numbers.
The first research paper on reducing numerical precision in deep learning was published by Suyog Gupta and colleagues in 2015, titled “Deep Learning with Limited Numerical Precision.” The results presented in this paper are quite interesting, as they show that it is possible to reduce 32-bit floating representations to 16-bit fixed-point representations with almost no loss in accuracy. However, this result only applies when using stochastic rounding, as it produces unbiased results on average.
In a paper published in 2018 by Naigang Wang and colleagues titled “Training Deep Neural Networks with 8-bit Floating Point Numbers,” neural networks were trained using 8-bit floating-point numbers. Training with 8-bit values is significantly more challenging than inference, as it is necessary to maintain fidelity during backpropagation to achieve machine precision during automatic differentiation.
Model efficiency can be improved through custom models. A good example is MobileNet V1 and MobileNet V2, which have been widely deployed on mobile devices. These model architectures essentially achieve higher computational efficiency convolution operations through recasting. This more efficient convolution form is known as depthwise separable convolution. Optimizations based on hardware profiling and neural architecture search can also be used to optimize for architecture latency, but this article will not elaborate on those.
Running machine learning models on resource-constrained devices opens the door to many new applications. Technological advancements that make standard machine learning more energy-efficient will help alleviate some concerns about the environmental impact of data science. Moreover, TinyML supports embedded devices to carry data-driven algorithms, leading to new intelligence applied in various scenarios, from preventive maintenance to detecting bird calls in forests.
While continuing to scale up models is a firm direction for some machine learning practitioners, the trend toward higher memory, computational, and energy-efficient machine learning algorithms is also emerging. TinyML is still in its infancy, with few experts in this direction. The references in this article list some important papers in the field of TinyML, and interested readers are encouraged to read them. This direction is rapidly growing and is expected to become an important new application of artificial intelligence in the industrial field in the coming years. Stay tuned.
Matthew Stewart, Ph.D. student in Environmental and Data Science at Harvard University, Machine Learning Consultant at Critical Future, personal blog: https://mpstewart.net
References
[1] Hinton, Geoffrey & Vinyals, Oriol & Dean, Jeff. (2015). Distilling the Knowledge in a Neural Network.
[2] D. Bankman, L. Yang, B. Moons, M. Verhelst and B. Murmann, “An always-on 3.8μJ/86% CIFAR-10 mixed-signal binary CNN processor with all memory on chip in 28nm CMOS,” 2018 IEEE International Solid-State Circuits Conference — (ISSCC), San Francisco, CA, 2018, pp. 222–224, doi: 10.1109/ISSCC.2018.8310264.
[3] Warden, P. (2018). Why the Future of Machine Learning is Tiny. Pete Warden’s Blog.
[4] Ward-Foxton, S. (2020). AI Sound Recognition on a Cortex-M0: Data is King. EE Times.
[5] Levy, M. (2020). Deep Learning on MCUs is the Future of Edge Computing. EE Times.
[6] Gruenstein, Alexander & Alvarez, Raziel & Thornton, Chris & Ghodrat, Mohammadali. (2017). A Cascade Architecture for Keyword Spotting on Mobile Devices.
[7] Kumar, A., Saurabh Goyal, and M. Varma. (2017). Resource-efficient Machine Learning in 2 KB RAM for the Internet of Things.
[8] Zhang, Yundong & Suda, Naveen & Lai, Liangzhen & Chandra, Vikas. (2017). Hello Edge: Keyword Spotting on Microcontrollers.
[9] Fedorov, Igor & Stamenovic, Marko & Jensen, Carl & Yang, Li-Chia & Mandell, Ari & Gan, Yiming & Mattina, Matthew & Whatmough, Paul. (2020). TinyLSTMs: Efficient Neural Speech Enhancement for Hearing Aids.
[10] Lin, Ji & Chen, Wei-Ming & Lin, Yujun & Cohn, John & Gan, Chuang & Han, Song. (2020). MCUNet: Tiny Deep Learning on IoT Devices.
[11] Chen, Tianqi & Moreau, Thierry. (2020). TVM: An Automated End-to-End Optimizing Compiler for Deep Learning.
[12] Weber, Logan, and Reusch, Andrew (2020). TinyML — How TVM is Taming Tiny.
[13] Krishnamoorthi, Raghuraman. (2018). Quantizing deep convolutional networks for efficient inference: A whitepaper.
[14] Yosinski, Jason & Clune, Jeff & Bengio, Y. & Lipson, Hod. (2014). How transferable are features in deep neural networks?.
[15] Lai, Liangzhen & Suda, Naveen & Chandra, Vikas. (2018). CMSIS-NN: Efficient Neural Network Kernels for Arm Cortex-M CPUs.
[16] Chowdhery, Aakanksha & Warden, Pete & Shlens, Jonathon & Howard, Andrew & Rhodes, Rocky. (2019). Visual Wake Words Dataset.
[17] Warden, Pete. (2018). Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition.
[18] Zemlyanikin, Maxim & Smorkalov, Alexander & Khanova, Tatiana & Petrovicheva, Anna & Serebryakov, Grigory. (2019). 512KiB RAM Is Enough! Live Camera Face Recognition DNN on MCU. 2493–2500. 10.1109/ICCVW.2019.00305.
Original link:
https://towardsdatascience.com/tiny-machine-learning-the-next-ai-revolution-495c26463868
