TinyML: Edge Voice Recognition Technology

1. Introduction

Voice recognition technology (Automatic Speech Recognition) is a technology that converts human speech into text. For example, voice assistants like “Hey, Siri” and “Hi Alexa” are applications of voice recognition technology. Through voice assistants, users can directly control home appliances such as air conditioners, TVs, curtains, and lights using their voice, making device control more convenient and natural.

Currently, most mainstream smart voice solutions on the market use online voice recognition combined with platform content to create a rich smart home ecosystem. However, the online connected method has also brought many uncertainties: privacy security, network latency, interaction speed, and connection stability. Therefore, offline localized voice control solutions on the edge have become another choice for users.

2. Edge Voice Recognition

Overview

Modern voice recognition can be traced back to 1952 when Davis and others developed the world’s first experimental system capable of recognizing 10 English digits, officially starting the voice recognition process. Voice recognition has developed over more than 70 years, but from a technical perspective, it can generally be divided into three stages:

In terms of the scope of recognition, voice recognition can be divided into “open domain recognition” and “closed domain recognition”:

Open Domain Recognition	Closed Domain Recognition
No pre-set recognition word set required	Requires a pre-set limited word set
Models are generally larger, with high engine computation	Recognition engines require fewer hardware resources
Relies on cloud-based “online” recognition	“Offline” voice recognition deployed on embedded devices
Mainly aimed at multi-turn dialogue interactions	Mainly for simple device control scenarios

Edge voice recognition falls under the category of closed domain recognition.

Principle

The workflow of voice recognition technology is shown in the following diagram:

The device microphone receives the raw voice, and the ADC converts the analog signal into a digital signal.
The acoustic front-end module performs echo cancellation, noise suppression, and voice activity detection to eliminate non-speech signals, then extracts MFCC features from normal speech signals. For detailed steps, refer to: 【TinyML】Tflite-micro implements offline command recognition on ESP32
The back-end processing part takes the voice features as input for the acoustic model to perform model inference, then combines with the language model to compute the score for command recognition, retrieves the command set based on the highest score, and finally outputs the recognition result.

The back-end recognition process can be understood as decoding the feature vector into text, which requires processing through two models:

Acoustic Model: This can be understood as modeling the sounds, converting voice input into acoustic representation output, or more accurately, giving the probability of the speech belonging to a certain acoustic symbol. In English, this acoustic symbol can be a syllable or a smaller phoneme; in Chinese, this acoustic symbol can be a tone or a phoneme as small as in English.

Language Model: Resolves the issue of homophones by finding the most probable string sequence from the candidate text sequences after the acoustic model provides the pronunciation sequence. The language model also constrains and re-scores the acoustic decoding to ensure the final recognition result conforms to grammatical rules. The most common are N-Gram language models and RNN-based language models.

Trends

Currently, open domain voice recognition is developing towards generative AI large models, while edge voice recognition research focuses mainly on the following three aspects, summarized as smaller models, faster response times, and stronger robustness:

Model Optimization and Compression: Various optimization algorithms such as hyperdimensional computing, memory swapping mechanisms, and constrained neural architecture search are proposed to reduce model size and computational demands through techniques like quantization, pruning, and knowledge distillation, while maintaining high accuracy.
Low Latency Real-time Processing: By adopting low-latency acoustic feature extraction algorithms, improved model inference methods, and streaming recognition techniques, while meeting real-time requirements, high accuracy is maintained.
Edge Acoustic Environment Adaptation: Noise suppression algorithms, data augmentation for acoustic model training in various noise environments, and multi-channel audio processing.

Open Source Tools and Datasets

Open Source Tools

Tool	Description	Programming Language	Link
Kaldi	Kaldi is a powerful speech recognition toolkit that supports various acoustic modeling and decoding techniques. It provides a series of tools and libraries for acoustic model training, feature extraction, decoding, etc.	C++ Shell Python	https://github.com/kaldi-asr/kaldi
vosk-api	Vosk is an offline open-source speech recognition tool. It can recognize 16 languages, including Chinese.	Python	https://github.com/alphacep/vosk-api
PocketSphinx	Another small speech recognition engine developed by Carnegie Mellon University, suitable for embedded systems and mobile applications.	Python C/C++	https://github.com/cmusphinx/pocketsphinx
DeepSpeech	An open-source speech recognition engine developed by Mozilla, using RNN and CNN to process acoustic features, providing pre-trained models for use.	Python	https://github.com/mozilla/DeepSpeech
Julius	An open-source large vocabulary continuous speech recognition engine supporting multiple languages and models.	C/C++	https://github.com/julius-speech/julius
HTK	HTK is a toolkit for building hidden Markov models (HMMs).	C	https://htk.eng.cam.ac.uk/
ESPnet	An end-to-end speech processing toolkit, including tasks such as speech recognition and synthesis.	Python Shell	https://github.com/espnet/espnet

Open Source Inference Frameworks

Platform Framework	Description	Programming Language
Tensorflow-Lite	A lightweight machine learning inference framework developed by Google specifically for mobile devices, embedded systems, and edge devices.	C++ Python
uTVM	uTVM is a branch of TVM, focusing on low-latency, efficient deep learning model inference on embedded systems, edge devices, and IoT devices.	C++ Python
Edge Impulse	Edge Impulse is an end-to-end platform for developing and deploying machine learning models for IoT devices, supporting rich sensor and data integration, model development tools, model deployment, and inference.	C/C++
NCNN	ncnn is an optimized neural network computing library focused on high-performance, efficient deep learning model inference on resource-constrained devices.	C++ Python

Open Source Databases

Database	Description	Link
TIMIT	TIMIT is a widely used dataset for speech recognition research, containing read sentences from speakers with various American English accents. It includes sentences recorded by speakers of various accents, genders, and ages for training and testing speech recognition systems.	https://catalog.ldc.upenn.edu/LDC93s1
LibriSpeech	LibriSpeech is a large dataset for speech recognition, containing audio and text from public domain English readings.	http://www.openslr.org/94/
Speech Ocean	Speech Ocean provides multi-language, cross-domain, and cross-modal AI data and related data services to the entire industry.	https://www.speechocean.com/dsvoice/catid-52.htm
Data Tang	Data Tang is a domestic AI data service company providing training datasets, data collection, and annotation customization services.	https://www.datatang.com/

More open source databaseshttp://www.openslr.org/resources.php

3. Technical Solutions

Based on Voice Recognition Chip Modules

Voice recognition chips usually have built-in DSP instruction sets necessary for signal processing and voice recognition, support floating-point operations with FPU units, and FFT accelerators, training audio signals through neural networks to improve voice signal recognition capabilities. The table below lists some suppliers of voice recognition chip modules.

Supplier	Chip Module	Wake-up and Command Customization	Iteration Method
Qiying Tailun	CI120 Series CI130 Series CI230 Series	https://aiplatform.chipintelli.com/home/index.html	Serial Port Burning OTA
Hailin Technology	HLK-v20	http://voice.hlktech.com/yunSound/public/toWebLogin	Serial Port Burning
Anxinke	VC Series Modules	https://udp.hivoice.cn/solutionai	Serial Port Burning
Jixin Intelligent	SU-03T SU-1X Series SU-3X Series	https://udp.hivoice.cn/solutionai	Serial Port Burning
Weichuang Zhiyin	WTK6900	Offline Customization Service	–
Jiuzhou Electronics	NRK330x Series	Offline Customization Service	–

Voice chip modules usually do not require external circuits; they can work by connecting to power, MIC, and speakers.

Taking HLK-v20 as an example, its default firmware includes some default wake-up words and command entries. After completing the recognition of voice commands, it outputs commands in a fixed protocol format through the serial port. Users can connect the corresponding main control chip for wireless communication and other functions based on their needs. At the same time, HLK-v20 can customize wake-up words and command words through Hailin Technology’s voice product management system, generating firmware SDK online after setting the command words, which users can then burn to the chip module for updates.

Based on Chip Manufacturer Development Frameworks

Chip manufacturers provide their own AI application development frameworks, with the provided voice recognition models deeply optimized for their chip architecture, greatly improving model inference speed.

Manufacturer	Framework	Related Link
ESP32	esp-adf voice development framework, based on ESP-IDF infrastructure and esp-sr voice recognition processing algorithm library.	https://github.com/espressif/esp-adf/tree/master
STM32	Provides an end-to-end solution allowing developers to quickly deploy various AI models on STM32 microcontrollers.	https://stm32ai.st.com/stm32-cube-ai/
Siliconlabs	Provides the MLTK toolset	https://siliconlabs.github.io/mltk/

Taking Espressif’s ESP-ADF framework as an example, it is a series of audio application components developed on the basis of ESP-IDF, with the core voice recognition algorithm component being ESP-SR, and it also provides various codec drivers and transmission protocols. It facilitates the development of audio and video applications.

TinyML: Edge Voice Recognition Technology

The hardware framework used is shown in the above figure. Since neural network inference is required, to ensure the accuracy of voice recognition, models such as LSTM or Seq2Seq are generally introduced, leading to larger final model files and certain memory resource requirements during runtime, so external Flash or SDCard is usually needed.

Based on Open Source Frameworks

Currently, there are many open-source neural network frameworks that can implement training of voice recognition models and industrial deployment solutions, such as the very popular Tensorflow-Lite in the industry and the neural network compiler TVM that supports automatic optimization of board-level operators.

The biggest advantage of open-source frameworks based on neural networks is full process self-control, including model training and deployment, while the downside is that the process is longer, and designing and tuning the network model is a significant challenge. The main process is shown in the above figure, and detailed steps can be found in 【TinyML】Tflite-micro implements offline command recognition on ESP32.

Based on Neural Network Chips

Neural network chips usually have high computing power, capable of meeting complex functional requirements, but are also more expensive, mainly used in smart speakers and intelligent voice control scenarios.

Manufacturer	Model
Allwinner Technology	R328, R58, R16, H6, F1C600
Amlogic	A113X, A112, S905D
BEKEN	BK3260
Intel	Atom x5-Z8350
MTK	MT7668, MT7658, MT8167A, MT8765V, MT7688AN, MT8516, MT2601
Rockchip	RK3308, RK3229, RK3326, OS1000RK
iFlytek	CSK4002

4. Conclusion

This article introduces the basic concepts and working principles of voice recognition technology, while also explaining four technical application solutions for implementing voice recognition on the edge.

Previous Recommendations

【TinyML】Tflite-micro implements offline command recognition on ESP32

【TinyML】Introduction to TinyML

Related posts

Leave a Comment Cancel reply