Amjad Ammar, Khan Lal, Chang Hsien-Tsung
Department of Computer Science and Information Engineering, Chang Gung University, Taoyuan, Taiwan.
Department of Physical Medicine and Rehabilitation, Chang Gung Memorial Hospital, Taoyuan, Taiwan.
PeerJ Comput Sci. 2021 Nov 3;7:e766. doi: 10.7717/peerj-cs.766. eCollection 2021.
Speech emotion recognition (SER) is a challenging issue because it is not clear which features are effective for classification. Emotionally related features are always extracted from speech signals for emotional classification. Handcrafted features are mainly used for emotional identification from audio signals. However, these features are not sufficient to correctly identify the emotional state of the speaker. The advantages of a deep convolutional neural network (DCNN) are investigated in the proposed work. A pretrained framework is used to extract the features from speech emotion databases. In this work, we adopt the feature selection (FS) approach to find the discriminative and most important features for SER. Many algorithms are used for the emotion classification problem. We use the random forest (RF), decision tree (DT), support vector machine (SVM), multilayer perceptron classifier (MLP), and k-nearest neighbors (KNN) to classify seven emotions. All experiments are performed by utilizing four different publicly accessible databases. Our method obtains accuracies of 92.02%, 88.77%, 93.61%, and 77.23% for Emo-DB, SAVEE, RAVDESS, and IEMOCAP, respectively, for speaker-dependent (SD) recognition with the feature selection method. Furthermore, compared to current handcrafted feature-based SER methods, the proposed method shows the best results for speaker-independent SER. For EMO-DB, all classifiers attain an accuracy of more than 80% with or without the feature selection technique.
语音情感识别(SER)是一个具有挑战性的问题,因为尚不清楚哪些特征对分类有效。与情感相关的特征总是从语音信号中提取出来用于情感分类。手工制作的特征主要用于从音频信号中识别情感。然而,这些特征不足以正确识别说话者的情感状态。本文研究了深度卷积神经网络(DCNN)的优势。使用一个预训练的框架从语音情感数据库中提取特征。在这项工作中,我们采用特征选择(FS)方法来找到用于SER的有区别且最重要的特征。许多算法被用于情感分类问题。我们使用随机森林(RF)、决策树(DT)、支持向量机(SVM)、多层感知器分类器(MLP)和k近邻(KNN)来对七种情感进行分类。所有实验都是利用四个不同的可公开获取的数据库进行的。对于与说话者相关(SD)的识别,我们的方法在使用特征选择方法时,对于Emo-DB、SAVEE、RAVDESS和IEMOCAP分别获得了92.02%、88.77%、93.61%和77.23%的准确率。此外,与当前基于手工制作特征的SER方法相比,所提出的方法在与说话者无关的SER方面显示出最佳结果。对于EMO-DB,无论是否使用特征选择技术,所有分类器的准确率都超过了80%。