Mamun Nursadul, Hansen John H L
CRSS: Center for Robust Speech Systems; Cochlear Implant Processing Laboratory (CILab), Department of Electrical and Computer Engineering, University of Texas at Dallas, USA.
IEEE/ACM Trans Audio Speech Lang Process. 2024;32:2616-2629. doi: 10.1109/taslp.2024.3366760. Epub 2024 Feb 22.
The presence of background noise or competing talkers is one of the main communication challenges for cochlear implant (CI) users in speech understanding in naturalistic spaces. These external factors distort the time-frequency (T-F) content including magnitude spectrum and phase of speech signals. While most existing speech enhancement (SE) solutions focus solely on enhancing the magnitude response, recent research highlights the importance of phase in perceptual speech quality. Motivated by multi-task machine learning, this study proposes a deep complex convolution transformer network (DCCTN) for complex spectral mapping, which simultaneously enhances the magnitude and phase responses of speech. The proposed network leverages a complex-valued U-Net structure with a transformer within the bottleneck layer to capture sufficient low-level detail of contextual information in the T-F domain. To capture the harmonic correlation in speech, DCCTN incorporates a frequency transformation block in the encoder structure of the U-Net architecture. The DCCTN learns a complex transformation matrix to accurately recover speech in the T-F domain from a noisy input spectrogram. Experimental results demonstrate that the proposed DCCTN outperforms existing model solutions such as the convolutional recurrent network (CRN), deep complex convolutional recurrent network (DCCRN), and gated convolutional recurrent network (GCRN) in terms of objective speech intelligibility and quality, both for seen and unseen noise conditions. To evaluate the effectiveness of the proposed SE solution, a formal listener evaluation involving four CI recipients was conducted. Results indicate a significant improvement in speech intelligibility performance for CI recipients in noisy environments. Additionally, DCCTN demonstrates the capability to suppress highly non-stationary noise without introducing musical artifacts commonly observed in conventional SE methods.
背景噪声或竞争谈话者的存在是人工耳蜗(CI)使用者在自然环境中进行语音理解时面临的主要通信挑战之一。这些外部因素会扭曲时频(T-F)内容,包括语音信号的幅度谱和相位。虽然大多数现有的语音增强(SE)解决方案仅专注于增强幅度响应,但最近的研究强调了相位在感知语音质量中的重要性。受多任务机器学习的启发,本研究提出了一种用于复杂频谱映射的深度复数卷积变压器网络(DCCTN),它同时增强了语音的幅度和相位响应。所提出的网络利用了一种复数值U-Net结构,在瓶颈层内有一个变压器,以捕获T-F域中足够的低级上下文信息细节。为了捕捉语音中的谐波相关性,DCCTN在U-Net架构的编码器结构中加入了一个频率变换块。DCCTN学习一个复数变换矩阵,以便从有噪声的输入频谱图中准确恢复T-F域中的语音。实验结果表明,所提出的DCCTN在客观语音清晰度和质量方面优于现有的模型解决方案,如卷积循环网络(CRN)、深度复数卷积循环网络(DCCRN)和门控卷积循环网络(GCRN),无论是在可见噪声条件还是不可见噪声条件下。为了评估所提出的SE解决方案的有效性,对四名CI接受者进行了正式的听众评估。结果表明,在嘈杂环境中,CI接受者的语音清晰度性能有了显著提高。此外,DCCTN展示了抑制高度非平稳噪声的能力,而不会引入传统SE方法中常见的音乐伪像。