Zhang Yixuan, Wang Heming, Wang DeLiang
Department of Computer Science and Engineering, Ohio State University, Columbus, OH 43210 USA.
Department of Computer Science and Engineering and the Center for Cognitive and Brain Sciences, Ohio State University, Columbus, OH 43210 USA.
IEEE/ACM Trans Audio Speech Lang Process. 2023;31:3760-3770. doi: 10.1109/TASLP.2023.3313427. Epub 2023 Sep 13.
As a fundamental problem in speech processing, pitch tracking has been studied for decades. While strong performance has been achieved on clean speech, pitch tracking in noisy speech is still challenging. Severe non-stationary noises not only corrupt the harmonic structure in voiced intervals but also make it difficult to determine the existence of voiced speech. Given the importance of voicing detection for pitch tracking, this study proposes a neural cascade architecture that jointly performs pitch estimation and voicing detection. The cascade architecture optimizes a speech enhancement module and a pitch tracking module, and is trained in a speaker-independent and noise-independent way. It is observed that incorporating the enhancement module improves both pitch estimation and voicing detection accuracy, especially in low signal-to-noise ratio (SNR) conditions. In addition, compared with frameworks that combine corresponding single-task models, the proposed multi-task framework achieves better performance and is more efficient. Experimental results show that the proposed method is robust to different noise conditions and substantially outperforms other competitive pitch tracking methods.
作为语音处理中的一个基本问题,基音跟踪已经研究了几十年。虽然在纯净语音上已经取得了强大的性能,但噪声语音中的基音跟踪仍然具有挑战性。严重的非平稳噪声不仅会破坏浊音区间的谐波结构,还会使确定浊音语音的存在变得困难。鉴于浊音检测对基音跟踪的重要性,本研究提出了一种联合执行基音估计和浊音检测的神经级联架构。该级联架构优化了一个语音增强模块和一个基音跟踪模块,并以独立于说话者和噪声的方式进行训练。据观察,纳入增强模块可提高基音估计和浊音检测的准确性,尤其是在低信噪比(SNR)条件下。此外,与结合相应单任务模型的框架相比,所提出的多任务框架具有更好的性能且更高效。实验结果表明,所提出的方法对不同噪声条件具有鲁棒性,并且显著优于其他有竞争力的基音跟踪方法。