Developer Huang Dinghua shares his journey participating in the “Qianfan AppBuilder – Smart Hardware AIOT Creative Competition Phase 1”.

AI technology

Introduction

Participants need to create an application that generates practical value in smart hardware terminal products based on one of the following five themes, using AppBuilder’s built-in tool components/custom components/knowledge base/database, etc.

Children’s Companion: AI agents that provide entertaining companionship for children, common scenarios include story reading, knowledge encyclopedia, and character companionship.
Learning and Education: Mainly provides learning education AI agents for preschool and K12 students, common scenarios include professional English speaking practice, AI sports teacher, math problem-solving, vocabulary memorization, and Chinese character learning.
Entertainment and Interaction: Provides entertainment interaction AI agents for users of all ages, common scenarios include comic avatar generation, travel planning, role-playing, game strategy assistant, and pet emotion recognition.
Health and Wellness for the Elderly: Provides health and wellness AI agents for the elderly, common scenarios include health Q&A, elderly health diet assistant, and lifestyle health reminders.
Health Monitoring: Create health monitoring AI agents, specific scenarios include AI tongue diagnosis, family AI doctor, and AI companionship for weight loss.

AI technology

Background

With the continuous advancement of AI and IoT technology and their deep integration in practical applications, both smart home and education fields will usher in a more intelligent and efficient development prospect, bringing unprecedented convenience and possibilities to people’s lives.

Based on the aforementioned competition background, I chose the children’s companion category as my theme, mainly to tell my three-year-old child some beautiful, universally meaningful children’s fairy tales, primarily from the classics like Grimm’s Fairy Tales, Andersen’s Fairy Tales, and One Thousand and One Nights. The focus is on companionship.

AI technology

Development Board Introduction

In terms of hardware, there are many options: Raspberry Pi, Jetson, Arduino, and so on. I have also played with OpenWRT firmware for a while, so it’s not completely zero-based. With the recent boom in AI, I thought the x86 architecture-based Nezha is the most timely choice.

1. Development Board Information:

Nezha, as the name suggests, is very small yet powerful.Detailed information is as follows, with pictures as proof:

Empowering Hardware with AI: Creating Personalized Smart Toys for Children

2. Unboxing Photos:

The red one is the main character Nezha, with three USB 3.2 ports and an HDMI port, a network port (on the left), and a 40-pin GPIO and serial port. The official also thoughtfully included a Ugreen wireless dual-band network card inside.

AI technology

Project Introduction

1. Overall Concept:

Product Introduction:

Micro Bunny is an early education robot with a cute shape and a read-aloud function for infants and toddlers. It features audio interaction and simple image recognition capabilities. Children can enjoy fun story time by conversing with the cute characters in the application. This product can also recognize images at specific locations, enabling scene-based teaching according to the image content and interacting with children while encouraging them through built-in children’s voice interaction, making learning more vivid and interesting.

Micro Bunny is not just a simple toy; it is an innovative early education tool designed to help children develop key cognitive skills and social interaction abilities in a fun way. It is an indispensable companion for parents eager to accompany their children’s growth and for toddlers eager to explore the world.

2. Solution Architecture:

2.1 Software Part

For the ASR part, consider deploying the local Whisper model or using an API model.
For the large model part, local model deployment can fully utilize the inference platform capabilities of OpenVINO; it can also use an API model.
For the TTS part, consider deploying local TTS synthesis or using an API model.

2.2 Hardware Part

Previously, I had a spare USB microphone for gaming, plus a single-line USB speaker bought online (because the Nezha board does not have an audio interface):

Later, I found that there are speaker-microphone all-in-one USB interfaces, which can save a USB port, making it feel a bit awkward.

AI technology

MVP (Minimum Viable Product)

1. Software Development Environment

Nezha naturally needs a big red base, and upon entering, it’s the familiar recipe and familiar taste. Windows 11. I was forced to download a KMS activator to activate it for debugging microphone device and speaker settings.

To be honest, IoT is more suitable for Linux systems. Personally, I really like the WSL system (I’m dead), which allows for easy Docker operations under Windows. I feel like many things wouldn’t work without Docker. Honestly, this system is quite good, configured with 8G memory and 64G storage, much better than I expected.

Of course, there is VSCode here, which is quite friendly for Python users. Although there’s no PyCharm, it’s manageable.

2. ASR Part

2.1 Local Library for Whisper

import pyaudioimport numpy as npimport whisper
# Initialize Whisper modelmodel = whisper.load_model("base")  # Options: "tiny", "base", "small", "medium", "large"
# PyAudio settingschunk = 1024  # Size of audio blockformat = pyaudio.paInt16  # Audio formatchannels = 1  # Mono channelrate = 16000  # Sample rate
p = pyaudio.PyAudio()
# Open audio streamstream = p.open(format=format,                channels=channels,                rate=rate,                input=True,                frames_per_buffer=chunk)
print("Starting real-time speech recognition...")
try:    while True:        # Read audio data        data = stream.read(chunk)        audio_data = np.frombuffer(data, dtype=np.int16)
        # Convert audio data to format required by Whisper        audio_float = audio_data.astype(np.float32) / 32768.0  # Normalize to [-1, 1]        audio_float = np.pad(audio_float, (0, 16000 - len(audio_float)), 'constant')  # Pad to 16 seconds
        # Perform speech recognition        result = model.transcribe(audio_float, fp16=False)  # fp16=False to avoid issues on some hardware        print("Recognition result:", result['text'])
except KeyboardInterrupt:    print("Stopping speech recognition.")
finally:    # Close audio stream    stream.stop_stream()    stream.close()    p.terminate()

Whisper can be deployed locally, and here I used the base, a very small library, but the effect… feels a bit lacking. For demonstration purposes, we still consider using the API method;

2.2 ASR Part – Azure API

Of course, your subscription key and region are up to you; according to the documentation, there should only be 5 free hours each month, not sure if testing will result in a lot of charges.😭

As for Baidu’s speech API, I’ll consider it next time; this competition did not require using Baidu TTS~


# This example requires environment variables named "SPEECH_KEY" and "SPEECH_REGION"speech_config = speechsdk.SpeechConfig(subscription='xxx', region='eastasia')speech_config.speech_recognition_language="zh-CN"
audio_config = speechsdk.audio.AudioConfig(use_default_microphone=True)speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_config)
print("Speak into your microphone.")
def stop_cb(evt):    print('CLOSING on {}'.format(evt))    speech_recognizer.stop_continuous_recognition()    global done    done = True
def sent_to_model(text):        response = get_response(text)    if response:        print("model response:",response)        text_to_speech(response)    
#speech_recognizer.recognizing.connect(lambda evt: print('RECOGNIZING: {}'.format(evt)))speech_recognizer.recognized.connect(lambda evt: print('RECOGNIZED: {}'.format(evt)) or sent_to_model(evt.result.text))speech_recognizer.session_started.connect(lambda evt: print('SESSION STARTED: {}'.format(evt)))speech_recognizer.session_stopped.connect(lambda evt: print('SESSION STOPPED {}'.format(evt)))speech_recognizer.canceled.connect(lambda evt: print('CANCELED {}'.format(evt)))
speech_recognizer.session_stopped.connect(stop_cb)speech_recognizer.canceled.connect(stop_cb)
speech_recognizer.start_continuous_recognition()while not done:    time.sleep(.5)

There are actually many API options; everyone should try more.

2.3 LLM Part

The competition requires uploading to AppBuilder, let’s go!

import appbuilderimport os
os.environ['APPBUILDER_TOKEN'] = "xxxxxxxxxxxxxxxxxxxxxxxxxxx"app_id = "acf19b27-1019-45fb-b163-a454d31ef014"def agent_query( query: str):
    # Initialize Agent instance    agent = appbuilder.AppBuilderClient(app_id)    # Create conversation ID    conversation_id = agent.create_conversation()    print("Your AppBuilder App ID is: {}".format(app_id))    print("processing")
    response_message = agent.run(conversation_id=conversation_id, query=query)    description = response_message.content.answer
    return description

if __name__ == '__main__':    prompt = 'Tell a story about Cinderella'    print(agent_query(prompt))

For those who like to tinker, you can refer to the local Qwen2.5-7B deployment with OpenVINO.

https://community.modelscope.cn/66f10c6b2db35d1195eed3c8.html

2.4 TTS Part

Using the API method, we are still using Azure here:

async def text_to_speech(text):    speech_config2 = speechsdk.SpeechConfig(subscription='xxx', region='eastasia')    audio_config2 = speechsdk.audio.AudioOutputConfig(use_default_speaker=True)
    # The neural multilingual voice can speak different languages based on the input text.    speech_config2.speech_synthesis_voice_name='zh-CN-XiaoyiNeural'    speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config2, audio_config=audio_config2)
    # Get text from the console and synthesize to the default speaker.        print("tts=&gt;&gt;&gt;",text)    speech_synthesis_result = speech_synthesizer.speak_text_async(text).get()
    if speech_synthesis_result.reason == speechsdk.ResultReason.SynthesizingAudioCompleted:        print("Speech synthesized for text [{}]".format(text))    elif speech_synthesis_result.reason == speechsdk.ResultReason.Canceled:        cancellation_details = speech_synthesis_result.cancellation_details        print("Speech synthesis canceled: {}".format(cancellation_details.reason))        if cancellation_details.reason == speechsdk.CancellationReason.Error:            if cancellation_details.error_details:                print("Error details: {}".format(cancellation_details.error_details))                print("Did you set the speech resource key and region values?")

There has actually been a small bug and issue; sometimes synthesized speech is directly captured by ASR, and I have tried many ways to set the ASR and TTS active switches to control that no sound is captured by ASR during TTS, but it has not been very successful. Later, introducing a threading mechanism, early tests suggest that this is quite effective, but the actual crosstalk seems not to be completely resolved.

I even raised an issue with the Microsoft ASR team.

https://github.com/Azure-Samples/cognitive-services-speech-sdk/issues/2608

Resolving this should require studying echo cancellation or the mechanism of crosstalk occurrence; I hope experts can provide guidance.

Local TTS deployment:

https://www.cnblogs.com/obullxl/p/18239327/NTopic2024060901

AI technology

Demonstration

I recorded a demonstration video; the recorded effect wasn’t great, and my son isn’t very familiar with it. The program settings could also be better, so I hope everyone can understand.

AI technology

Afterword

First of all, I would like to thank Baidu for the first smart hardware competition, and thank the Qianfan platform for various resources and APIs. Also, I thank my family, especially my wife and son, as I have been debugging with some frustration lately, perhaps influenced by the third prince, haha.

This can be considered a birthday gift for my son. I hope he can truly grow up.

In the future, this board can still do many things, such as connecting a camera and multimodal large models for image or video recognition, or even video analysis with YOLO. If I connect wheels and servo motors, it can run.

So, thinking about it, the space is quite large. This was indeed the first time I could combine hardware and large models to create something fun. I feel that the field of AI has limitless potential. Once again, I thank my family for their companionship and support, and thank everyone for reading to the end.

Developer Huang Dinghua shares his journey participating in the “Qianfan AppBuilder – Smart Hardware AIOT Creative Competition Phase 1”.

1. Software Development Environment

2. ASR Part

2.3 LLM Part

2.4 TTS Part

Related posts

Leave a Comment Cancel reply