Creating Your Own Smart Assistant 'JARVIS' with ESP32 and ChatGPT

Yesterday, I shared the design plan for the smart assistant ‘JARVIS’. I wonder what everyone thinks after reading it? If this plan were actually implemented, what would the effect be?

Today, I am excited to present the project showcase, let’s see how to create our own smart voice assistant!

Phase two project report has been uploaded to the electronic forest:

https://www.eetree.cn/project/detail/2026

Project Introduction

This project uses a PC as the main computing platform (the program is universal and can run directly on Raspberry Pi and other Linux boards, achieving wearability), using ESP32-S2 to create a network camera module as the hardware for the entire system’s visual input. The goal is to design an ESP32-S2 visual development board with a built-in camera and display, and certain expandability.

Workflow:

First, use speech_recognition for voice input, and after completion, import the voice data into the Whisper speech recognition model for voice recognition.
Meanwhile, after voice input is completed, the PC will capture the latest frame from the ESP32-S2 video stream and use neural network models like YOLO or RetinaNet, FPN for image recognition.
After obtaining both image recognition and voice recognition information, integrate them, and through GPT prompt training, combine all information into text and input it into ChatGPT.
Finally, use Edge-TTS to play back the reply returned by ChatGPT, achieving voice interaction with a vision-capable ChatGPT.

Solution Block Diagram and Schematic Introduction

The project block diagram has been detailed in the phase one project report, interested students can check out phase one project:

https://www.eetree.cn/project/detail/1969

Project Schematic

The entire hardware design consists of six parts: the ESP32-S2 main control part, the camera peripheral part, the flash drive part that provides supplementary light, the screen display part, the serial connection part, and the native USB interface part of the ESP32-S2.

More details about the circuit design are not elaborated here. For those who want to know more, you can click on the “Read Original” at the end to view the complete project report.

PCB Drawing and Board Production

As a portable development module, to minimize the size, the circuit board is designed to be very compact, which brings considerable difficulty to wiring, especially for the onboard camera part, which has many lines, and to maintain signal quality, the routing cannot repeatedly cross holes to change layers. The final product is shown below, all wiring has only a single layer change, and the USB differential signal lines have been length-matched.

The camera position design supports dual-direction use of the camera. By default, the camera is folded over and attached to the board, so the camera and LCD are on the same side, equivalent to a front-facing camera; if you want to use the rear camera mode, just insert the camera directly as shown in the figure above.

Board Physical Diagram

The prototype circuit board is shown in the figure:

Since there are components on both sides of the circuit board, this presents some challenges for soldering. The ESP32-S2-MINI module used requires hot soldering for assembly. Therefore, the board is divided into two; the ESP32-S2-MINI module is soldered on the front with a hot soldering station, while the back is soldered with a soldering iron; other areas are soldered with a hot soldering station on the back and a soldering iron on the front.

The physical diagram of the board after component soldering

Back of the Circuit Board

Front of the Circuit Board

Code Design

For the ESP32-S2 part, I used CircuitPython to perform a simple test to quickly check whether each part of the hardware works normally.

The code uses the Arduino framework. According to the official tutorial, I downloaded the ESP32 development board package in Arduino, and after placing the order, I opened the official CameraWebServer example and modified the example code.

PC-side code. Image acquisition is completed using OpenCV, and object recognition is done using ImageAI. ImageAI can use the three models: ResNet50, YOLOv3, and Tiny-YOLOv3. Here, I encapsulated the image acquisition and recognition into an independent class that can be directly imported as an external library.

For the voice input part, the voice input uses the speech_recognition library, and recognition uses Whisper. At the same time, once the voice recognition starts, it indicates that we have asked a question, so image recognition begins simultaneously.

Main loop part. The voice information and object recognition information obtained above have been converted into text. Next, we will process these texts, merge them with appropriate AI prompts, and send them together into ChatGPT. The ChatGPT API I used is poe_api_wrapper, and you need to obtain your token from the web cookie. The AI used is Acouchy.

After receiving feedback from ChatGPT, we need to use TTS to convert the text content into voice and output it. Here, I used Edge-TTS.

The project has been open-sourced on the Electronic Forest project website, and you can click on “Read Original” to view the complete project report for specific code details.

Function Demonstration

Flashlight supplement demonstration

Object recognition

The complete demonstration process can be referenced in the project video.

Reflections

This project is one of the more complex ones I have done. The hardware part is more complex compared to conventional development board designs. Due to the presence of a camera, achieving a compact design makes wiring quite difficult; while the software part integrates voice recognition, ChatGPT, OpenCV, CNN, and TTS to work collaboratively, which presents certain challenges. After completing this project, I feel that my abilities have greatly improved, and I have gained a better understanding of the implementation process for many features.

To learn more about the project details, you really have to look at the project report to appreciate the intricacies involved. This project has been open-sourced and displayed on the Electronic Forest, what are you waiting for? Hurry up and click on the “Read Original” at the end to explore!

If you also have great design ideas, don’t hesitate!

The FastBond event is in full swing; phase 1 tasks are so easy, just a block diagram introduction video and a project report will earn you a 100 yuan JD card reward. You can complete it in one night!

Click the link below to participate:

https://www.eetree.cn/page/digikey-fastbond

Completing the task can also make you an event ambassador. Hurry up and invite like-minded friends to participate in the event and earn promotional rewards!

Image source: Internet, if there is any infringement, please contact for deletion

END

Hardhe Academy

The Hardhe team is committed to providing standardized core skill courses for electronic engineers and related students, helping everyone effectively improve their professional abilities at all stages of learning and work.

Hardhe Academy

Let’s explore and advance together in the field of electronics

Follow the Hardhe public account for direct access to the classroom anytime

Click to read the original text to view the complete project

Creating Your Own Smart Assistant ‘JARVIS’ with ESP32 and ChatGPT

Leave a Comment Cancel reply

Related posts

Leave a Comment Cancel reply