Offline Voice Control Without Cloud - ESP32 + INMP441 On/Off Recognition

If we were to identify the biggest change in smart voice assistants over the past few years, it would undoubtedly be that an increasing number of “ears” are starting to listen locally instead of relying entirely on the cloud for processing.

This project exemplifies a typical and valuable engineering task: using an ESP32 and an INMP441 digital microphone, along with a model trained by Edge Impulse, to create a fully offline “mini voice assistant”— just say the wake word “marvin” and then say “on / off” to control an LED with your voice.

Project Overview: ESP32 Voice Assistant Without Cloud

Hardware:

ESP32 (240 MHz) as the main controller
INMP441 digital MEMS microphone for audio capture
Two LEDs: one for voice status indication and one for controlling the “load”

Offline Voice Control Without Cloud - ESP32 + INMP441 On/Off Recognition

Model:

Trained using a subset of the Google Speech Commands V2 dataset
Recognizes four classes:noise / marvin / on / off

Interaction Logic:

Wake word: say “marvin” → system enters “listening mode”
Say “on” or “off” within 10 seconds → control LED to turn on / off
Timeout or command execution → automatically exit listening mode

Operation Mode:

Fully offline, all inference is done locally on the ESP32
Latency is approximately 200–500 ms, providing a user experience close to that of household voice switches

This solution is very suitable for offline home control, small robot control, local voice operation for industrial equipment, or as a TinyML case for courses or graduation projects.

Main Features

Local voice recognition based on ESP32 + INMP441, fully offline operation, without relying on cloud services.
Model trained using the Edge Impulse platform, with one-click export to Arduino library for deployment on ESP32.
Voice interaction uses a wake word + command design: wake word “marvin”, command words “on / off” to control the LED.
Audio sampling rate of 16 kHz, based on I²S digital microphone capture, with simple front-end hardware.
Low latency, local inference, fast response, suitable for voice control of IoT devices.
Scalable architecture: easily add more custom wake words and commands to expand into more application scenarios.

Hardware Design

Component	Quantity	Function
ESP32 Development Board	1	Main processor for offline voice recognition
INMP441 MEMS Microphone	1	High-precision audio input for ESP32 voice-to-text
LED (Indicator)	1	Visual feedback for wake word detection
LED (Control)	1	Output controlled by voice command
Resistors (220Ω)	2	LED current limiting
Breadboard and Jumper Wires	As needed	Prototype connections

INMP441 → ESP32 Connection (Pins Used in the Project):

INMP441 Pin	ESP32 GPIO	Signal Description
Left/Right	3V3	Channel selection (3.3V = Right channel)
WS (Word Select)	GPIO 25	I²S audio left-right clock
SCK (Serial Clock)	GPIO 26	Bit clock for audio data
SD (Serial Data)	GPIO 33	Audio data input for ESP32
VDD	3V3	Power supply (3.3V)
Ground	Ground	Ground reference

The article also highlights a detail:

The default LRCLK pin used in the Edge Impulse example code is GPIO32; if you connect according to the default pins, you won’t need to modify the example code at all;
The schematic in this project shows GPIO25, but the author notes that you just need to synchronize the I²S configuration accordingly.

For the LED part:

Control LED connected to GPIO22;
Indicator LED connected to GPIO23;
Both are in series with a 220 Ω resistor for current limiting.

The overall wiring is minimal, making it perfect for quick prototyping on a breadboard.

System Architecture

From an architectural perspective, this project can be divided into three layers:

Data and Model Layer (PC + Edge Impulse)

Prepare the dataset (using Google Speech Commands V2)
Perform feature extraction, network training, and model evaluation on Edge Impulse
Finally export as an Arduino library for ESP32 to use

Inference and Control Layer (ESP32)

Run the classifier locally using the inference library generated by Edge Impulse
Capture I²S audio → features → inference → obtain confidence levels for each class
The upper state machine is responsible for the two-phase logic of “wake word + command”

Hardware Interaction Layer (Microphone + LED)

INMP441 provides a digital audio stream sampled at 16 kHz
One LED serves as an indicator, showing different states such as listening, waking, executing, etc.
Another LED acts as the controlled object, simulating an actual load (light, relay, etc.)

The overall idea is to package the AI component as much as possible into Edge Impulse, with the ESP32 only responsible for capturing, invoking, and executing, significantly reducing engineering costs.

Data and Model: Feeding Commands to ESP32 Using Public Datasets

The author did not record a large number of voice samples but instead chose to use established public datasets:

Recommended common voice datasets:

Google Speech Commands (keyword wake words, on/off/yes/no, etc.)
Mozilla Common Voice (multilingual large dataset)
Other voice data collections from platforms (Kaggle, Hugging Face, etc.)

Actual usage in this project:

noise (background noise)
marvin (wake word)
on
off

Classes from Google Speech Commands V2: include:

The general process in Edge Impulse:

1. Register an account, create a new project, select the target device as Espressif ESP-EYE (ESP32 240 MHz) to ensure model budget fits ESP32.

Offline Voice Control Without Cloud - ESP32 + INMP441 On/Off Recognition

2. On the Data Acquisition page, upload the dataset, labeling folders/tags as noise, marvin, on, off.

Offline Voice Control Without Cloud - ESP32 + INMP441 On/Off Recognition

3. In Impulse Design, add audio feature extraction (e.g., MFCC) + neural network classifier.

Offline Voice Control Without Cloud - ESP32 + INMP441 On/Off Recognition

4. Use Autotune to automatically adjust feature parameters, then generate features (you can see the distribution of various classes in feature space).

Offline Voice Control Without Cloud - ESP32 + INMP441 On/Off Recognition

5. Train the model using default network parameters, check accuracy and confusion matrix (the author believes that over 85% is sufficient for prototyping).

Offline Voice Control Without Cloud - ESP32 + INMP441 On/Off Recognition

6. Use Test Model → Classify All to evaluate the performance of the test set, confirm no significant deviations, and export as an Arduino Library.

Offline Voice Control Without Cloud - ESP32 + INMP441 On/Off Recognition

For engineers, this is equivalent to outsourcing “data + model training” to the platform, allowing them to focus on ensuring clean labels and sufficient data volume.

Code Structure: FreeRTOS + I²S + Edge Impulse Inference Library

On the software side, the author did not simply copy the example but created a relatively complete engineering encapsulation:

1. Key Libraries

Edge Impulse Inference Library:

For example, ESP32_STT_Recognition_inferencing.h (your project name will differ)
Encapsulates the model, feature processing, and run_classifier interface

FreeRTOS related header files:

Used to run background collection tasks, achieving non-blocking audio capture

driver/i2s.h:

Configures and drives the I²S interface, responsible for communication with INMP441

Offline Voice Control Without Cloud - ESP32 + INMP441 On/Off Recognition

2. Important Configuration and State Variables

There are several critical parameters in the code:

const float COMMAND_CONFIDENCE_THRESHOLD = 0.80; const float RECOGNITION_CONFIDENCE_THRESHOLD = 0.50;
const unsigned long LISTENING_DURATION_MS = 10000;

const int CONTROL_LED_PIN = 22;   const int INDICATOR_LED_PIN = 23;  
static const uint32_t sample_buffer_size = 2048;

static bool wake_word_detected = false; static unsigned long wake_word_timestamp = 0;static bool listening_mode = false;      static bool led_state = false;         static inference_t inference;       static signed short sampleBuffer[2048];         static bool debug_nn = false;               static bool record_status = true;

Explanation of the logic:

0.5 confidence: used as a prompt for “what was heard”, e.g., blink the indicator light;
0.8 confidence: only above this will the command be executed (to prevent false triggers);
10-second listening window: after the wake word is triggered, the system only accepts on/off commands in the next 10 seconds;
A set of state variables tracks: whether it has been awakened, whether it is in listening mode, the current LED state, wake timestamp, etc.

3. Setup: Initialize Serial, LED, and Audio System

In setup(), the main tasks are three:

Open serial debugging (115200)

Serial.begin(115200);while (!Serial); // optional

Configure the two LED pins as outputs and pull them low

pinMode(CONTROL_LED_PIN, OUTPUT);digitalWrite(CONTROL_LED_PIN, LOW);pinMode(INDICATOR_LED_PIN, OUTPUT);digitalWrite(INDICATOR_LED_PIN, LOW);

Call microphone_inference_start():

Allocate audio buffer
Initialize I²S (16 kHz sampling, DMA buffer)
Start background collection task

if (microphone_inference_start(EI_CLASSIFIER_RAW_SAMPLE_COUNT) == false) {   ei_printf("ERR: Could not allocate audio buffer\n");   Return;}

If the buffer or I²S initialization fails, an error will be printed to the serial port for easy troubleshooting.

4. Loop: Sampling → Inference → Command Processing

The main loop logic can be condensed into four steps:

Capture a frame of audio

bool m = microphone_inference_record();if (!m) {   ei_printf("ERR: Failed to record audio...\n");   return;}

Construct signal and call classifier

signal_t signal;signal.total_length = EI_CLASSIFIER_RAW_SAMPLE_COUNT;signal.get_data = &amp;microphone_audio_signal_get_data;EI_IMPULSE_ERROR r = run_classifier(&amp;signal, &amp;result, debug_nn);

Process recognition results according to business logic

handle_wake_word_and_commands(result);

Print debugging information to serial

Print DSP and inference time
Print confidence levels for each label to observe model performance

ei_printf("Predictions (DSP: %d ms., Classification: %d ms.): \n", ...);for (size_t ix = 0; ix &lt; EI_CLASSIFIER_LABEL_COUNT; ix++) {   ei_printf("    %s: %.2f\n", result.classification[ix].label, ...);}

5. LED “Language”: Visualizing the State Machine

The author has written a set of LED animation functions:

Recognizing the wake word “marvin”: the indicator LED blinks twice
Entering listening mode: the indicator LED stays on
Recognizing “on”: control the LED to turn on + indicator LED blinks to confirm
Recognizing “off”: control the LED to turn off + LED performs a “fade out” effect
Recognizing other words (but confidence does not reach 0.8): the indicator LED blinks once
Listening timeout: the indicator LED turns off, exiting listening mode

show_wake_word_pattern()show_listening_mode()show_turning_on_pattern()show_turning_off_pattern()show_general_recognition_pattern()stop_listening_mode()

This is a highly recommended approach in embedded projects without screens—using lights to convey the state machine.

6. Audio Capture and I²S Driver

There is a dedicated set of functions responsible for audio capture:

The background task continuously reads audio via I²S, writing to the DMA buffer;
Once a frame is full, it copies the data to the inference buffer via a callback;
Performs necessary scaling/conversion on samples, passing them to the Edge Impulse model;
i2s_init() / i2s_deinit() are responsible for configuring the sampling rate (16 kHz), DMA depth, and GPIO pins.

audio_inference_callback(uint32_t n_bytes)capture_samples(void* arg)microphone_inference_start(uint32_t n_samples)microphone_inference_record(void)microphone_audio_signal_get_data(size_t offset, size_t length, float *out_ptr)microphone_inference_end(void)

i2s_init(uint32_t sampling_rate)i2s_deinit(void)

This part is very valuable for future reuse of “ESP32 + I²S microphone”.

Testing Experience: Say “marvin”, then say on/off to light up

According to the author’s description, the overall system experience is roughly as follows:

Flash the model library and project to the ESP32;
Open the serial monitor to see the confidence levels for each inference;
Say “marvin” into the microphone:

The indicator LED blinks twice, indicating successful recognition of the wake word;
The system enters a 10-second listening window (indicator LED stays on);

Then say “on”:

Control the LED to turn on;
The indicator LED blinks once to confirm;

Repeat wake + say “off”: control the LED to turn off.

Open Source Code: https://github.com/Circuit-Digest/ESP32-Voice-Control-Using-Edge-Impulse

Conclusion

This project is actually a complete “ESP32 Voice Recognition from 0 to 1” tutorial:

Upper layer uses Edge Impulse to handle data and models;
Middle layer uses TinyML library for real-time inference on ESP32;
Lower layer connects the audio stream using I²S microphone and FreeRTOS;
Finally, a few LEDs visualize the states and results.

If you happen to have an ESP32 and an INMP441, this project is very suitable as a “first offline voice assistant” to replicate. Start by using it to turn on lights, establish the link, and then connect the backend to the devices you truly want to control, giving your next IoT project a pair of “ears that can understand human speech”.

Note: The above content has been summarized and generated by AI. Feel free to click “Read the original” to replicate it yourself.

Click to read the original for more

Offline Voice Control Without Cloud – ESP32 + INMP441 On/Off Recognition

1. Key Libraries

4. Loop: Sampling → Inference → Command Processing

5. LED “Language”: Visualizing the State Machine

6. Audio Capture and I²S Driver

Open Source Code: https://github.com/Circuit-Digest/ESP32-Voice-Control-Using-Edge-Impulse

Leave a Comment Cancel reply

1. Key Libraries

4. Loop: Sampling → Inference → Command Processing

5. LED “Language”: Visualizing the State Machine

6. Audio Capture and I²S Driver

Open Source Code: https://github.com/Circuit-Digest/ESP32-Voice-Control-Using-Edge-Impulse

Related posts

Leave a Comment Cancel reply