Department of Computer Engineering, University of Engineering and Technology, Taxila 47050, Pakistan.
Department of Software Engineering, University of Engineering and Technology, Taxila 47050, Pakistan.
Sensors (Basel). 2020 Oct 23;20(21):6008. doi: 10.3390/s20216008.
Speech emotion recognition (SER) plays a significant role in human-machine interaction. Emotion recognition from speech and its precise classification is a challenging task because a machine is unable to understand its context. For an accurate emotion classification, emotionally relevant features must be extracted from the speech data. Traditionally, handcrafted features were used for emotional classification from speech signals; however, they are not efficient enough to accurately depict the emotional states of the speaker. In this study, the benefits of a deep convolutional neural network (DCNN) for SER are explored. For this purpose, a pretrained network is used to extract features from state-of-the-art speech emotional datasets. Subsequently, a correlation-based feature selection technique is applied to the extracted features to select the most appropriate and discriminative features for SER. For the classification of emotions, we utilize support vector machines, random forests, the k-nearest neighbors algorithm, and neural network classifiers. Experiments are performed for speaker-dependent and speaker-independent SER using four publicly available datasets: the Berlin Dataset of Emotional Speech (Emo-DB), Surrey Audio Visual Expressed Emotion (SAVEE), Interactive Emotional Dyadic Motion Capture (IEMOCAP), and the Ryerson Audio Visual Dataset of Emotional Speech and Song (RAVDESS). Our proposed method achieves an accuracy of 95.10% for Emo-DB, 82.10% for SAVEE, 83.80% for IEMOCAP, and 81.30% for RAVDESS, for speaker-dependent SER experiments. Moreover, our method yields the best results for speaker-independent SER with existing handcrafted features-based SER approaches.
语音情感识别(SER)在人机交互中起着重要作用。由于机器无法理解上下文,因此从语音中识别情感并进行精确分类是一项具有挑战性的任务。为了进行准确的情感分类,必须从语音数据中提取与情感相关的特征。传统上,从语音信号中进行情感分类使用手工制作的特征,但它们的效率不足以准确描述说话者的情感状态。在这项研究中,探讨了深度卷积神经网络(DCNN)在 SER 中的优势。为此,使用预训练网络从最先进的语音情感数据集提取特征。随后,应用基于相关性的特征选择技术对提取的特征进行选择,以选择最适合和最具区分度的特征用于 SER。对于情感分类,我们使用支持向量机、随机森林、k-最近邻算法和神经网络分类器。使用四个公开可用的数据集(柏林情感语音数据集(Emo-DB)、萨里视听表达情感数据集(SAVEE)、交互情感对偶运动捕捉数据集(IEMOCAP)和 Ryerson 视听情感语音和歌曲数据集(RAVDESS))进行了说话人相关和说话人无关的 SER 实验。我们提出的方法在说话人相关的 SER 实验中分别达到了 Emo-DB 数据集 95.10%、SAVEE 数据集 82.10%、IEMOCAP 数据集 83.80%和 RAVDESS 数据集 81.30%的准确率。此外,我们的方法在基于现有手工制作特征的说话人无关 SER 方法中取得了最佳的结果。