School of Life Sciences, Tiangong University, Tianjin, China.
Tianjin Key Laboratory of Optoelectronic Detection Technology and System, Tianjin, China.
Biomed Tech (Berl). 2021 Nov 29;66(6):613-625. doi: 10.1515/bmt-2021-0112. Print 2021 Dec 20.
Automatic voice pathology detection and classification plays an important role in the diagnosis and prevention of voice disorders. To accurately describe the pronunciation characteristics of patients with dysarthria and improve the effect of pathological voice detection, this study proposes a pathological voice detection method based on a multi-modal network structure. First, speech signals and electroglottography (EGG) signals are mapped from the time domain to the frequency domain spectrogram via a short-time Fourier transform (STFT). The Mel filter bank acts on the spectrogram to enhance the signal's harmonics and denoise. Second, a pre-trained convolutional neural network (CNN) is used as the backbone network to extract sound state features and vocal cord vibration features from the two signals. To obtain a better classification effect, the fused features are input into the long short-term memory (LSTM) network for voice feature selection and enhancement. The proposed system achieves 95.73% for accuracy with 96.10% F1-score and 96.73% recall using the Saarbrucken Voice Database (SVD); thus, enabling a new method for pathological speech detection.
自动语音病理学检测和分类在语音障碍的诊断和预防中起着重要作用。为了准确描述构音障碍患者的发音特征,提高病理语音检测效果,本研究提出了一种基于多模态网络结构的病理语音检测方法。首先,通过短时傅里叶变换(STFT)将语音信号和电声门图(EGG)信号从时域映射到频域声谱图。梅尔滤波器组作用于声谱图以增强信号的谐波并进行去噪。其次,使用预训练的卷积神经网络(CNN)作为骨干网络,从两种信号中提取声音状态特征和声带振动特征。为了获得更好的分类效果,将融合特征输入到长短期记忆(LSTM)网络中进行语音特征选择和增强。使用 Saarbrucken 语音数据库(SVD),该系统的准确率为 95.73%,F1 得分为 96.10%,召回率为 96.73%,从而为病理语音检测提供了一种新方法。