As a flagship international conference in the field of speech-related research, ASRU2025 (IEEE Workshop on Automatic Speech Recognition and Understanding) will be held from December 6-10 in Honolulu, Hawaii. The theme of IEEE ASRU2025 is “Towards the New Era of Speech Understanding,” focusing on gathering academia and industry to discuss new developments in the speech field, including but not limited to speech recognition systems, spoken dialogue systems, speech analysis, paralinguistic phenomena in speech, applications of automatic speech recognition and speech analysis, speech large language models, and speech foundational models.
The official conference website is: https://2025.ieeeasru.org/.
The Audio Speech and Language Processing Research Group at Northwestern Polytechnical University (ASLP@NPU) will present six papers at this conference, covering numerous research directions in intelligent speech processing. Below is the information regarding the papers presented at this conference, along with links to the original papers for sharing.

NO.1 EchoFree: Towards Ultra Lightweight and Efficient Neural Acoustic Echo Cancellation
Authors: Li Xingchen, Kang Boyi, Wang Ziqian, Zhang Zihan, Liu Mingshuai, Fu Zhonghua, Xie Lei
Abstract: In recent years, neural networks (NN) have been widely applied in the field of acoustic echo cancellation (AEC). However, existing methods struggle to meet the real-world demands for low latency and computation while maintaining performance. To address this challenge, we propose EchoFree, an ultra-lightweight neural AEC framework that combines linear filtering with a neural post-filter. Specifically, we designed a neural post-filter based on Bark scale spectral features. Additionally, we introduced a two-stage optimization strategy utilizing a self-supervised learning (SSL) model to enhance model performance. We evaluated our method on the blind test set of the ICASSP 2023 AEC challenge. The results show that our model requires only 278K parameters and 30 MMAC of computational complexity, outperforming existing low-complexity AEC models and achieving performance comparable to the state-of-the-art lightweight model DeepVQE-S.
Paper Arxiv link: http://arxiv.org/abs/2508.06271
Paper QR code:

NO.2 XEmoRAG: Cross-Lingual Emotion Transfer with Controllable Intensity Using Retrieval-Augmented Generation
Authors: Zuo Tianlun, Hu Jingbin, Li Yuke, Zhu Xinfang, Li Hai, Yan Ying, Liu Junhui, Xie Danming, Xie Lei
Abstract: This paper proposes a cross-lingual zero-shot emotion transfer speech synthesis framework named XEmoRAG, achieving emotion transfer from Chinese to Thai without parallel emotional data. The method first extracts language-independent emotional embeddings from Chinese reference speech and retrieves emotion-matched speech from a carefully constructed Thai emotional database, enabling controllable emotion transfer. Through a flow matching alignment module, it reduces pitch and duration mismatches while integrating Chinese timbre into Thai synthesis, enhancing rhythm accuracy and emotional expression, and maintaining speaker characteristics and emotional consistency. Experimental results show that XEmoRAG can synthesize natural and expressive Thai speech relying solely on Chinese reference audio, demonstrating its ability to flexibly perform cross-lingual emotion transfer under low-resource conditions.
Paper Arxiv link: https://arxiv.org/pdf/2508.07302
Paper QR code:

NO.3 Llasa+: Free Lunch for Accelerated and Streaming Llama-Based Speech Synthesis
Authors: Tian Wenjie, Zhu Xinfang, Xie Hanke, Ye Zhen, Xue Wei, Xie Lei
Abstract:
Recent advancements in the text-to-speech (TTS) field have achieved remarkable naturalness and flexibility, especially following the development of methods based on large language models (LLMs). However, existing autoregressive (AR) structures and large-scale LLM models (such as Llasa) still face significant challenges in inference latency and streaming synthesis. To address these limitations, we propose Llasa+, an accelerated streaming TTS model built on Llasa. Specifically, to speed up the generation process, we introduce two plug-and-play multi-token prediction (MTP) modules after freezing the backbone network. These modules enable the model to predict multiple tokens in a single autoregressive step. Furthermore, to mitigate error propagation that may arise from inaccurate MTP predictions, we designed a novel validation algorithm that utilizes the frozen backbone network to validate the generated tokens, allowing Llasa+ to achieve acceleration without sacrificing generation quality. Finally, we also designed a causal decoder capable of achieving streaming speech reconstruction from tokens. Extensive experiments show that, despite being trained only on the LibriTTS dataset, Llasa+ achieves a 1.48x speedup without loss of generation quality. Moreover, this framework combining MTP and validation can be applied to accelerate any LLM-based model.
Code and model: https://github.com/ASLP-lab/LLaSA_Plus
Paper Arxiv link: https://arxiv.org/pdf/2508.06262
Paper QR code:

NO.4 Efficient Scaling for LLM-based ASR
Authors: Mu Bingsheng, Shao Yiwen, Wei Kun, Yu Dong, Xie Lei
Abstract: Large language model (LLM)-based automatic speech recognition (ASR) has achieved outstanding performance but often incurs high computational costs. This work investigates how to efficiently achieve optimal LLM-ASR performance. Through comprehensive and controllable experiments, we found that pre-training the speech encoder before integrating it with the LLM significantly improves scaling efficiency compared to the standard joint post-training approach for LLM-ASR. Based on this insight, we propose a new multi-stage LLM-ASR training strategy, called Encoder-First Integration (EFIN). Among all evaluated training strategies, EFIN consistently achieves better performance (relative character error rate of 21.1%) with significantly lower computational budget (49.9% fewer floating-point operations). Additionally, we derived a scaling law that approximates ASR error rates as a function of computation, providing practical guidance for scaling LLM-ASR.
Paper Arxiv link: https://arxiv.org/pdf/2508.04096
Paper QR code:

NO.5 DiffRhythm+: Controllable and Flexible Full-Length Song Generation with Preference Optimization
Authors: Chen Huakang, Jiang Yuepeng, Ma Guobin, Hao Chunbo, Wang Shuai, Yao Jixun, Ning Ziqian, Meng Meng, Luan Jian, Xie Lei
Abstract: As a core form of musical art, songs reflect the richness of human wisdom and creativity. Despite significant advancements in generative modeling for long song generation in recent years, current full-length song synthesis systems still face numerous challenges, including data imbalance, insufficient controllability, and inconsistent music quality. DiffRhythm, as a pioneering diffusion-based model, has advanced the field by generating expressive full-length songs with vocals and accompaniment. However, its performance is constrained by the imbalance of the model training dataset and limited control over musical styles, leading to noticeable quality discrepancies in generated works and restricting creative flexibility. To address these limitations, we propose DiffRhythm+, an enhanced, controllable, and flexible full-length song generation diffusion framework. DiffRhythm+ utilizes a significantly expanded and balanced training dataset to alleviate issues such as lyric repetition and omission while promoting a richer emergence of musical techniques and expressiveness. The framework introduces a multi-modal style conditioning strategy, allowing users to precisely specify musical styles through descriptive text and reference audio, significantly enhancing creative control and diversity. We further introduce a performance optimization method directly aligned with user preferences, guiding the model to continuously generate outputs that better align with human preferences across various evaluation metrics. Extensive experiments show that DiffRhythm+ achieves significant improvements in naturalness, arrangement complexity, and listener satisfaction compared to existing systems.
Paper Arxiv link: https://arxiv.org/abs/2507.12890
Paper QR code:

NO.6 REF-VC: Robust, Expressive and Fast Zero-Shot Voice Conversion with Diffusion Transformers
Authors: Jiang Yuepeng, Ning Ziqian, Wang Shuai, Wang Chengjia, Bi Mengxiao, Zhu Pengcheng, Fu Zhonghua, Xie Lei
Abstract: In practical voice conversion applications, environmental noise in the source voice and the user’s demand for expressive output pose severe challenges. Traditional automatic speech recognition (ASR)-based methods ensure noise robustness and conversion stability but limit prosodic diversity; while self-supervised learning representation (SSL)-based models significantly enhance conversion results’ expressiveness, they suffer from timbre leakage and sensitivity to noise. This paper proposes a robust and expressive voice conversion system, REF-VC. Key innovations include: (1) employing a random erasure strategy to reduce the model’s focus on “ineffective information” in SSL representations, such as background noise and speaker timbre, thereby enhancing both noise robustness and expressiveness; (2) utilizing an implicit alignment mechanism inspired by E2TTS to suppress the reconstruction of “ineffective information” in SSL representations; (3) integrating a fast model to accelerate flow matching inference, significantly reducing the number of inference steps to 4. Experimental results show that REF-VC outperforms baseline models like Seed-VC in zero-shot scenarios with noise datasets while performing comparably to Seed-VC on clean test sets. Additionally, REF-VC is compatible with singing conversion without additional training.
Paper Arxiv link: https://arxiv.org/abs/2508.04996
Paper QR code:


Welcome to follow the ASLP Laboratory’s WeChat public account for more information on speech research!
“Building the most open, cutting-edge, and practical artificial intelligence laboratory”
