Nam Youngja, Lee Chankyu
Humanities Research Institute, Chung-Ang University, Seoul 06974, Korea.
Department of Korean Language and Literature, Chung-Ang University, Seoul 06974, Korea.
Sensors (Basel). 2021 Jun 27;21(13):4399. doi: 10.3390/s21134399.
Convolutional neural networks (CNNs) are a state-of-the-art technique for speech emotion recognition. However, CNNs have mostly been applied to noise-free emotional speech data, and limited evidence is available for their applicability in emotional speech denoising. In this study, a cascaded denoising CNN (DnCNN)-CNN architecture is proposed to classify emotions from Korean and German speech in noisy conditions. The proposed architecture consists of two stages. In the first stage, the DnCNN exploits the concept of residual learning to perform denoising; in the second stage, the CNN performs the classification. The classification results for real datasets show that the DnCNN-CNN outperforms the baseline CNN in overall accuracy for both languages. For Korean speech, the DnCNN-CNN achieves an accuracy of 95.8%, whereas the accuracy of the CNN is marginally lower (93.6%). For German speech, the DnCNN-CNN has an overall accuracy of 59.3-76.6%, whereas the CNN has an overall accuracy of 39.4-58.1%. These results demonstrate the feasibility of applying the DnCNN with residual learning to speech denoising and the effectiveness of the CNN-based approach in speech emotion recognition. Our findings provide new insights into speech emotion recognition in adverse conditions and have implications for language-universal speech emotion recognition.
卷积神经网络(CNNs)是一种用于语音情感识别的先进技术。然而,卷积神经网络大多应用于无噪声的情感语音数据,其在情感语音去噪方面的适用性证据有限。在本研究中,提出了一种级联去噪卷积神经网络(DnCNN)-卷积神经网络架构,用于在有噪声条件下对韩语和德语语音的情感进行分类。所提出的架构由两个阶段组成。在第一阶段,DnCNN利用残差学习的概念进行去噪;在第二阶段,卷积神经网络进行分类。真实数据集的分类结果表明,DnCNN-卷积神经网络在两种语言的总体准确率方面均优于基线卷积神经网络。对于韩语语音,DnCNN-卷积神经网络的准确率达到95.8%,而卷积神经网络的准确率略低(93.6%)。对于德语语音,DnCNN-卷积神经网络的总体准确率为59.3 - 76.6%,而卷积神经网络的总体准确率为39.4 - 58.1%。这些结果证明了将具有残差学习的DnCNN应用于语音去噪的可行性以及基于卷积神经网络的方法在语音情感识别中的有效性。我们的研究结果为不利条件下的语音情感识别提供了新的见解,并对语言通用的语音情感识别具有启示意义。