Breakthrough! Elevating AI Telephony Bot Performance

From the early text-matching-based (including regular expressions) telephony bots to today’s intelligent dialogue systems based on large models, the response methods have evolved from preset fixed replies to dynamically generated responses by large models. In this evolution, the key issues affecting dialogue effectiveness have also changed.

In the previous generation of telephony bots, the main bottleneck was the accuracy of text matching—once a match was incorrect or an unpredicted statement was encountered, the bot often returned irrelevant answers, severely affecting the coherence of the dialogue. However, large model telephony bots have largely overcome this barrier, capable of generating more natural and reasonable responses. Nevertheless, one problem remains unresolved: when the environment is noisy, speech recognition easily captures irrelevant content, leading to misunderstandings in dialogue, interruptions in the process, or answers deviating from expectations, thus affecting the overall smoothness of interaction.

This issue needs to be addressed in two scenarios:

The first scenario involves environmental noise on the other end, such as wind noise, traffic noise, horn sounds, keyboard typing, etc. This can be addressed using a VAD (Voice Activity Detection) model, which primarily determines whether there is human voice present. It starts counting when human voice is detected and continues until the voice is no longer detected for a certain period. In previous articles, we introduced our own dynamic rollback overlapping sliding window method to improve the accuracy of human voice detection. In some challenging negative examples collected from our daily production environment, testing with common market solutions (such as webrtc-vad, speex vad, silero vad, etc.) yielded an accuracy of 67%, while our solution improved the accuracy to 91%.

The second scenario involves other people speaking, such as voices from a television or conversations happening nearby. This is the most challenging problem, and to this day, there is no good solution. However, this issue is quite common and significantly affects dialogue effectiveness. Therefore, through continuous research and effort, we have finally made progress. Please listen to the effects:

Before filtering:

After filtering:

Before filtering:

After filtering:

Before filtering:

After filtering:

Of course, since this was just completed today, I only simulated and conducted tests myself. We still need colleagues to help collect more examples of similar background voices and various mixed voices for further training.

Since a new model has been introduced, it has increased server resource consumption and audio processing time, which requires compensating for this time in other parts of the dialogue (NLP, ASR, LLM, TTS). There’s no need to worry about this; we already have a solution, hehe~

Leave a Comment