TinyML: The Next Wave of Artificial Intelligence Revolution

This article is adapted from: 51cto, Author: Matthew Stewart
Artificial Intelligence (AI) is rapidly moving from the “cloud” to the “edge”, entering increasingly smaller IoT devices. The machine learning processes implemented on microprocessors at the terminal and edge side are referred to as Tiny Machine Learning, or TinyML. More precisely, TinyML refers to the methods, tools, and technologies that engineers use to implement machine learning on devices operating below the mW power range.

TinyML: The Next Wave of Artificial Intelligence Revolution

Figure 1 Overview of TinyML on Embedded Devices

Over the past decade, we have witnessed an exponential growth in the scale of machine learning algorithms due to increased processor speeds and the emergence of big data. At the TinyML 2020 Summit, companies like NVIDIA, ARM, Qualcomm, Google, Microsoft, and Samsung showcased the latest achievements in micro machine learning. The summit reached several important conclusions:
  • For many application scenarios, TinyML technology and hardware have evolved to a practical stage;

  • Significant breakthroughs have been made in algorithms, networks, and ML models below 100KB;

  • There is a rapid growth in low-power demands in the visual and audio fields.

TinyML is an interdisciplinary field of machine learning and embedded IoT devices, representing an emerging engineering discipline with the potential to revolutionize many industries.
The main beneficiaries of TinyML are in the fields of edge computing and energy-efficient computing. TinyML originates from the concept of the Internet of Things (IoT). The traditional approach in IoT is to send data from local devices to the cloud for processing. Some concerns exist regarding this approach in terms of privacy, latency, storage, and energy efficiency.
  • Energy Efficiency: Data transmission, whether wired or wireless, is highly energy-consuming, approximately an order of magnitude more than local computation using multiply-accumulate units (MAUs). The most energy-efficient approach is to develop IoT systems with local data processing capabilities. The “data-centric” computing philosophy has been explored by some AI pioneers and is now being applied.

  • Privacy: There are privacy risks in data transmission. Data can be intercepted by malicious actors, and when stored in a single location like the cloud, the inherent security of the data decreases. By keeping most data on the device, communication needs can be minimized, thereby enhancing security and privacy.

  • Storage: Much of the data collected by IoT devices is useless. Imagine a security camera recording the entrance to a building 24/7. For most of the day, the camera is not performing any useful function as no anomalies occur. Implementing smarter systems that activate only when necessary can reduce storage capacity needs and consequently the amount of data that needs to be transmitted to the cloud.

  • Latency: Standard IoT devices, such as Amazon Alexa, need to transmit data to the cloud for processing and then respond based on the algorithm’s output. In this sense, the device is merely a convenient gateway to the cloud model, akin to a carrier pigeon between the device and Amazon servers. The device itself is not intelligent, and the response speed entirely depends on internet performance. If the internet speed is slow, then Amazon Alexa’s response will also be slow. Smart IoT devices with built-in automatic speech recognition reduce or even eliminate dependence on external communication, thus lowering latency.

The issues above are driving the development of edge computing. The concept of edge computing is to implement data processing functions on devices deployed at the “edge” of the cloud. These edge devices are highly limited in terms of memory, computation, and functionality, necessitating the development of more efficient algorithms, data structures, and computational methods.
Such improvements are also applicable to larger models, achieving several orders of magnitude improvement in machine learning model efficiency without reducing model accuracy. For example, Microsoft’s Bonsai algorithm can be as small as 2 KB but performs better than the usual 40MB kNN algorithm or the 4MB neural network. This result may seem insignificant, but to put it differently—achieving the same accuracy on a model that has been shrunk by a factor of ten thousand is quite impressive. Models of such small size can run on an Arduino Uno with 2KB of memory. In short, such machine learning models can now be built on microcontrollers that cost $5.
Machine learning is at a crossroads, with two computing paradigms advancing simultaneously: computation-centric computing and data-centric computing.
In the computation-centric paradigm, data is stored and analyzed on instances in data centers; whereas in the data-centric paradigm, processing is executed at the raw location of the data.
Despite the current computation-centric paradigm seemingly reaching its limits, the data-centric computing paradigm is just beginning.

Currently, IoT devices and embedded machine learning models are becoming increasingly common. Many people may not notice many of these devices, such as smart doorbells, smart thermostats, and smartphones that can be “woken up” by simply speaking or picking them up.

TinyML: The Next Wave of Artificial Intelligence Revolution
Figure 2 Hierarchical Structure of the Cloud
The following will delve into how TinyML works and its current and future applications.
Application Areas
Currently, the two main application areas of TinyML are:
Keyword Detection: Most people are very familiar with this application, such as “Hey, Siri” and “Hey, Google” keywords, often referred to as “hot words” or “wake words”. Devices continuously listen to audio input from the microphone and are trained to respond only to specific sound sequences that match the learned keywords. These devices are simpler and use fewer resources than automatic speech recognition (ASR). Devices like Google smartphones also use cascade architectures to achieve speaker verification for security purposes.
Visual Wake Words: Visual wake words use images as a substitute for wake words, indicating presence or absence through binary classification of images. For example, designing a smart lighting system that activates upon detecting a person’s presence and turns off when they leave. Similarly, wildlife photographers can use visual wake functions to start capturing when specific animals appear, and security cameras can start recording upon detecting human activity.

The following diagram comprehensively displays the current applications of TinyML machine learning.

TinyML: The Next Wave of Artificial Intelligence Revolution
Figure 3 Overview of TinyML Machine Learning Use Cases (Image Source: NXP)
How TinyML Works
The working mechanism of TinyML algorithms is almost identical to traditional machine learning models, typically involving model training on the user’s computer or in the cloud. The real power of TinyML comes into play during post-processing, commonly referred to as “deep compression”.

TinyML: The Next Wave of Artificial Intelligence Revolution

Figure 4 Illustration of Deep Compression (Source: [ArXiv Paper](https://arxiv.org/pdf/1510.00149.pdf))
Model Distillation
After training, models need to be modified to create a more compact representation. The main techniques for achieving this include pruning and knowledge distillation.
The basic idea of knowledge distillation is to take into account the sparsity or redundancy present within a larger network. While large-scale networks have higher representational capacity, if the network capacity is not saturated, a smaller network (with fewer neurons) can represent it. In the research work published by Hinton et al. in 2015, the embedded information transferred from the Teacher model to the Student model is referred to as “dark knowledge”.

The following diagram illustrates the process of knowledge distillation:

TinyML: The Next Wave of Artificial Intelligence Revolution
Figure 5 Illustration of Knowledge Representation in Distilled Models
In the diagram, the Teacher model is a trained convolutional neural network model tasked with transferring its “knowledge” to a smaller convolutional network model called the Student model, which has fewer parameters. This process is known as “knowledge distillation”, which is used to encapsulate the same knowledge in a smaller network, thereby achieving a form of network compression suitable for memory-constrained devices.

Similarly, pruning helps achieve a more compact model representation. Broadly speaking, pruning aims to remove neurons that are nearly useless for output predictions. This process typically involves smaller neural weights, while larger weights are retained due to their higher importance during inference. Subsequently, the network can be retrained on the pruned architecture to fine-tune the output.

TinyML: The Next Wave of Artificial Intelligence Revolution
Figure 6 Illustration of Pruning Knowledge Representation in Distilled Models
Quantization
After distilling, the model needs to undergo quantization to form a format compatible with embedded device architectures.

Why perform quantization? Assume an Arduino Uno uses the ATmega328P microcontroller with 8-bit numerical operations. Ideally, to run a model on the Uno, unlike many desktop and laptop computers that use 32-bit or 64-bit floating-point representations, the model’s weights must be stored as 8-bit integer values. By quantizing the model, the storage size of the weights is reduced to 1/4, that is, from 32 bits to 8 bits, with minimal impact on accuracy, typically around 1-3%.

TinyML: The Next Wave of Artificial Intelligence Revolution
Figure 7 Illustration of Quantization Error during 8-bit Encoding (Image Source: [TinyML](https://tinymlbook.com/))
Due to quantization errors, some information may be lost during the quantization process. For example, on integer-based platforms, a floating-point representation of 3.42 may be truncated to 3. To address this issue, research has proposed quantization-aware (QA) training as an alternative. QA training essentially restricts the network to use only values available on quantized devices during the training process.
Huffman Coding
Coding is an optional step. It stores data in the most efficient way, further reducing model size. Huffman coding is commonly used.
Compilation

After quantization and coding, the model needs to be converted into a format that can be interpreted by lightweight network interpreters, the most widely used being TF Lite (approximately 500 KB in size) and TF Lite Micro (approximately 20 KB). The model will be compiled into C or C++ code that most microcontrollers can use effectively while utilizing memory, and will be run by the interpreter on the device.

TinyML: The Next Wave of Artificial Intelligence Revolution
Figure 8 Workflow Diagram of TinyML Applications (Source: [TinyML](https://tinymlbook.com/))
Most TinyML technologies address the complexities caused by processing on microcontrollers. TF Lite and TF Lite Micro are very small because they remove all non-essential features. Unfortunately, they also remove some useful features, such as debugging and visualization. This means that if errors occur during deployment, it may be difficult to determine the cause.
Why Not Train on the Device?
Training on devices introduces additional complexities. Due to reduced numerical precision, ensuring sufficient accuracy for network training is extremely difficult. With the precision of a standard desktop computer, automatic differentiation methods are generally accurate. The precision of computing derivatives can reach an incredible 10^{-16}, but performing automatic differentiation on 8-bit numbers will yield poor precision results. During the backpropagation process, derivatives are combined and ultimately used to update neural parameters. At such low numerical precision, the model’s accuracy may be poor.
Despite these issues, some neural networks have been trained using 16-bit and 8-bit floating-point numbers.
How Is Computational Efficiency?
Customizing models can improve computational efficiency. A good example is MobileNet V1 and MobileNet V2, which are model architectures widely deployed on mobile devices, essentially a convolutional neural network that achieves higher computational efficiency through recasting convolutions. This more efficient form of convolution is called depthwise separable convolution. Techniques such as hardware-based profiling and neural architecture search can also be used to optimize for architectural latency.
A New Wave of Artificial Intelligence Revolution
Running machine learning models on resource-constrained devices opens doors to many new applications. Technological advancements that make standard machine learning more energy-efficient will help alleviate some concerns about the environmental impact of data science. Additionally, TinyML supports embedded devices to incorporate new intelligence based on data-driven algorithms, thus applying it in various scenarios from preventive maintenance to detecting bird calls in forests.
While continuing to scale up models is a firm direction for some machine learning practitioners, the development of machine learning algorithms that are more memory, computation, and energy-efficient is also a new trend. TinyML is still in its infancy, with few experts in this direction. This field is growing rapidly and will become an important new application of AI in the industrial sector in the coming years.
Arduino Technology Exchange Group
Why introduce TinyML? Because TinyML is the realization of artificial intelligence on the microcontrollers of massive IoT devices, expected to become an important new application of AI in the industrial sector in the coming years; moreover, the Arduino boards we will be using next can be used for TinyML project development, so let’s first understand TinyML, and there will be different project shares to follow.

The exchange group is ready for everyone, and next, Teacher Hard He will lead everyone into Arduino development boards, please scan the QR code below to join the group. If the group QR code expires, you can reply with the keyword “Arduino” on the “Hard He Academy” public account to join the group chat.

TinyML: The Next Wave of Artificial Intelligence Revolution

END

Hard He Academy

The Hard He team is dedicated to providing standardized core skill courses for electronic engineers and related students, helping everyone effectively enhance their professional capabilities at various stages of learning and work.

TinyML: The Next Wave of Artificial Intelligence Revolution

Hard He Academy

We explore and advance together in the field of electronics

Follow the Hard He public account to reach the classroom at any time

TinyML: The Next Wave of Artificial Intelligence Revolution

Click to read the original text for more

Leave a Comment

×