TinyML: Edge Voice Recognition Technology

TinyML: Edge Voice Recognition Technology
1. Introduction
Voice recognition technology (Automatic Speech Recognition) is a technology that converts human speech into text. For example, voice assistants like “Hey, Siri” and “Hi Alexa” are applications of voice recognition technology. Through voice assistants, users can directly control home appliances such as air conditioners, TVs, curtains, and lights using their voice, making device control more convenient and natural.

TinyML: Edge Voice Recognition TechnologyCurrently, most mainstream smart voice solutions on the market use online voice recognition combined with platform content to create a rich smart home ecosystem. However, the online connected method has also brought many uncertainties: privacy security, network latency, interaction speed, and connection stability. Therefore, offline localized voice control solutions on the edge have become another choice for users.

2. Edge Voice Recognition
1
Overview

Modern voice recognition can be traced back to 1952 when Davis and others developed the world’s first experimental system capable of recognizing 10 English digits, officially starting the voice recognition process. Voice recognition has developed over more than 70 years, but from a technical perspective, it can generally be divided into three stages:

TinyML: Edge Voice Recognition Technology

In terms of the scope of recognition, voice recognition can be divided into “open domain recognition” and “closed domain recognition”:

Open Domain Recognition Closed Domain Recognition
No pre-set recognition word set required Requires a pre-set limited word set
Models are generally larger, with high engine computation Recognition engines require fewer hardware resources
Relies on cloud-based “online” recognition “Offline” voice recognition deployed on embedded devices
Mainly aimed at multi-turn dialogue interactions Mainly for simple device control scenarios

Edge voice recognition falls under the category of closed domain recognition.

2
Principle

The workflow of voice recognition technology is shown in the following diagram:

TinyML: Edge Voice Recognition Technology

  • The device microphone receives the raw voice, and the ADC converts the analog signal into a digital signal.

  • The acoustic front-end module performs echo cancellation, noise suppression, and voice activity detection to eliminate non-speech signals, then extracts MFCC features from normal speech signals. For detailed steps, refer to: 【TinyML】Tflite-micro implements offline command recognition on ESP32

  • The back-end processing part takes the voice features as input for the acoustic model to perform model inference, then combines with the language model to compute the score for command recognition, retrieves the command set based on the highest score, and finally outputs the recognition result.
    TinyML: Edge Voice Recognition Technology
    The back-end recognition process can be understood as decoding the feature vector into text, which requires processing through two models:
    Acoustic Model: This can be understood as modeling the sounds, converting voice input into acoustic representation output, or more accurately, giving the probability of the speech belonging to a certain acoustic symbol. In English, this acoustic symbol can be a syllable or a smaller phoneme; in Chinese, this acoustic symbol can be a tone or a phoneme as small as in English.
    Language Model: Resolves the issue of homophones by finding the most probable string sequence from the candidate text sequences after the acoustic model provides the pronunciation sequence. The language model also constrains and re-scores the acoustic decoding to ensure the final recognition result conforms to grammatical rules. The most common are N-Gram language models and RNN-based language models.

3
Trends

Currently, open domain voice recognition is developing towards generative AI large models, while edge voice recognition research focuses mainly on the following three aspects, summarized as smaller models, faster response times, and stronger robustness:
  • Model Optimization and Compression: Various optimization algorithms such as hyperdimensional computing, memory swapping mechanisms, and constrained neural architecture search are proposed to reduce model size and computational demands through techniques like quantization, pruning, and knowledge distillation, while maintaining high accuracy.

  • Low Latency Real-time Processing: By adopting low-latency acoustic feature extraction algorithms, improved model inference methods, and streaming recognition techniques, while meeting real-time requirements, high accuracy is maintained.

  • Edge Acoustic Environment Adaptation: Noise suppression algorithms, data augmentation for acoustic model training in various noise environments, and multi-channel audio processing.

4
Open Source Tools and Datasets

  • Open Source Tools

Tool Description Programming Language Link
Kaldi

Kaldi is a powerful speech recognition toolkit that supports various acoustic modeling and decoding techniques. It provides a series of tools and libraries for acoustic model training, feature extraction, decoding, etc.

C++

Shell

Python

https://github.com/kaldi-asr/kaldi

vosk-api

Vosk is an offline open-source speech recognition tool. It can recognize 16 languages, including Chinese.

Python

https://github.com/alphacep/vosk-api

PocketSphinx

Another small speech recognition engine developed by Carnegie Mellon University, suitable for embedded systems and mobile applications.

Python

C/C++

https://github.com/cmusphinx/pocketsphinx

DeepSpeech

An open-source speech recognition engine developed by Mozilla, using RNN and CNN to process acoustic features, providing pre-trained models for use.

Python

https://github.com/mozilla/DeepSpeech

Julius

An open-source large vocabulary continuous speech recognition engine supporting multiple languages and models.

C/C++

https://github.com/julius-speech/julius

HTK

HTK is a toolkit for building hidden Markov models (HMMs).

C

https://htk.eng.cam.ac.uk/

ESPnet

An end-to-end speech processing toolkit, including tasks such as speech recognition and synthesis.

Python

Shell

https://github.com/espnet/espnet

  • Open Source Inference Frameworks

Platform Framework Description Programming Language

Tensorflow-Lite

A lightweight machine learning inference framework developed by Google specifically for mobile devices, embedded systems, and edge devices.

C++
Python

uTVM

uTVM is a branch of TVM, focusing on low-latency, efficient deep learning model inference on embedded systems, edge devices, and IoT devices.

C++
Python

Edge Impulse

Edge Impulse is an end-to-end platform for developing and deploying machine learning models for IoT devices, supporting rich sensor and data integration, model development tools, model deployment, and inference.

C/C++

NCNN

ncnn is an optimized neural network computing library focused on high-performance, efficient deep learning model inference on resource-constrained devices.

C++
Python
  • Open Source Databases

Database Description Link

TIMIT

TIMIT is a widely used dataset for speech recognition research, containing read sentences from speakers with various American English accents. It includes sentences recorded by speakers of various accents, genders, and ages for training and testing speech recognition systems.

https://catalog.ldc.upenn.edu/LDC93s1

LibriSpeech

LibriSpeech is a large dataset for speech recognition, containing audio and text from public domain English readings.

http://www.openslr.org/94/

Speech Ocean

Speech Ocean provides multi-language, cross-domain, and cross-modal AI data and related data services to the entire industry.

https://www.speechocean.com/dsvoice/catid-52.htm

Data Tang

Data Tang is a domestic AI data service company providing training datasets, data collection, and annotation customization services.

https://www.datatang.com/

More open source databaseshttp://www.openslr.org/resources.php
3. Technical Solutions
1
Based on Voice Recognition Chip Modules
Voice recognition chips usually have built-in DSP instruction sets necessary for signal processing and voice recognition, support floating-point operations with FPU units, and FFT accelerators, training audio signals through neural networks to improve voice signal recognition capabilities. The table below lists some suppliers of voice recognition chip modules.
Supplier Chip Module Wake-up and Command Customization Iteration Method

Qiying Tailun

CI120 Series

CI130 Series

CI230 Series

https://aiplatform.chipintelli.com/home/index.html

Serial Port Burning

OTA

Hailin Technology

HLK-v20

http://voice.hlktech.com/yunSound/public/toWebLogin

Serial Port Burning

Anxinke

VC Series Modules

https://udp.hivoice.cn/solutionai

Serial Port Burning

Jixin Intelligent

SU-03T

SU-1X Series

SU-3X Series

https://udp.hivoice.cn/solutionai

Serial Port Burning

Weichuang Zhiyin

WTK6900

Offline Customization Service

Jiuzhou Electronics

NRK330x Series

Offline Customization Service

Voice chip modules usually do not require external circuits; they can work by connecting to power, MIC, and speakers.

TinyML: Edge Voice Recognition Technology

Taking HLK-v20 as an example, its default firmware includes some default wake-up words and command entries. After completing the recognition of voice commands, it outputs commands in a fixed protocol format through the serial port. Users can connect the corresponding main control chip for wireless communication and other functions based on their needs. At the same time, HLK-v20 can customize wake-up words and command words through Hailin Technology’s voice product management system, generating firmware SDK online after setting the command words, which users can then burn to the chip module for updates.
2
Based on Chip Manufacturer Development Frameworks
Chip manufacturers provide their own AI application development frameworks, with the provided voice recognition models deeply optimized for their chip architecture, greatly improving model inference speed.
Manufacturer Framework Related Link
ESP32

esp-adf voice development framework, based on ESP-IDF infrastructure and esp-sr voice recognition processing algorithm library.

https://github.com/espressif/esp-adf/tree/master

STM32

Provides an end-to-end solution allowing developers to quickly deploy various AI models on STM32 microcontrollers.

https://stm32ai.st.com/stm32-cube-ai/

Siliconlabs

Provides the MLTK toolset

https://siliconlabs.github.io/mltk/

Taking Espressif’s ESP-ADF framework as an example, it is a series of audio application components developed on the basis of ESP-IDF, with the core voice recognition algorithm component being ESP-SR, and it also provides various codec drivers and transmission protocols. It facilitates the development of audio and video applications.

TinyML: Edge Voice Recognition Technology

The hardware framework used is shown in the above figure. Since neural network inference is required, to ensure the accuracy of voice recognition, models such as LSTM or Seq2Seq are generally introduced, leading to larger final model files and certain memory resource requirements during runtime, so external Flash or SDCard is usually needed.

3
Based on Open Source Frameworks
Currently, there are many open-source neural network frameworks that can implement training of voice recognition models and industrial deployment solutions, such as the very popular Tensorflow-Lite in the industry and the neural network compiler TVM that supports automatic optimization of board-level operators.
TinyML: Edge Voice Recognition Technology

The biggest advantage of open-source frameworks based on neural networks is full process self-control, including model training and deployment, while the downside is that the process is longer, and designing and tuning the network model is a significant challenge. The main process is shown in the above figure, and detailed steps can be found in 【TinyML】Tflite-micro implements offline command recognition on ESP32.

4
Based on Neural Network Chips

Neural network chips usually have high computing power, capable of meeting complex functional requirements, but are also more expensive, mainly used in smart speakers and intelligent voice control scenarios.

Manufacturer Model

Allwinner Technology

R328, R58, R16, H6, F1C600

Amlogic

A113X, A112, S905D

BEKEN

BK3260

Intel

Atom x5-Z8350

MTK

MT7668, MT7658, MT8167A, MT8765V, MT7688AN, MT8516, MT2601

Rockchip

RK3308, RK3229, RK3326, OS1000RK

iFlytek

CSK4002

4. Conclusion
This article introduces the basic concepts and working principles of voice recognition technology, while also explaining four technical application solutions for implementing voice recognition on the edge.

Previous Recommendations

【TinyML】Tflite-micro implements offline command recognition on ESP32
【TinyML】Introduction to TinyML

Leave a Comment