Singkul Sattaya, Woraratpanya Kuntpong
Faculty of Information Technology, King Mongkut's Institute of Technology Ladkrabang, 1 Chalong Krung, Lat Krabang, 10520, Bangkok, Thailand.
Heliyon. 2022 Mar 28;8(3):e09196. doi: 10.1016/j.heliyon.2022.e09196. eCollection 2022 Mar.
Speech emotion recognition (SER) plays an important role in global business today to improve service efficiency. In the literature of SER, many techniques have been using deep learning to extract and learn features. Recently, we have proposed end-to-end learning for a deep residual local feature learning block (DeepResLFLB). The advantages of end-to-end learning are low engineering effort and less hyperparameter tuning. Nevertheless, this learning method is easily to fall into an overfitting problem. Therefore, this paper described the concept of the "verify-to-classify" framework to apply for learning vectors, extracted from feature spaces of emotional information. This framework consists of two important portions: speech emotion learning and recognition. In speech emotion learning, consisting of two steps: speech emotion verification enrolled training and prediction, the residual learning (ResNet) with squeeze-excitation (SE) block was used as a core component of both steps to extract emotional state vectors and build an emotion model by the speech emotion verification enrolled training. Then the in-domain pre-trained weights of the emotion trained model are transferred to the prediction step. As a result of the speech emotion learning, the accepted model-validated by EER-is transferred to the speech emotion recognition in terms of out-domain pre-trained weights, which are ready for classification using a classical ML method. In this manner, a suitable loss function is important to work with emotional vectors. Here, two loss functions were proposed: angular prototypical and softmax with angular prototypical losses. Based on two publicly available datasets: Emo-DB and RAVDESS, both with high- and low-quality environments. The experimental results show that our proposed method can significantly improve generalized performance and explainable emotion results, when evaluated by standard metrics: EER, accuracy, precision, recall, and F1-score.
语音情感识别(SER)在当今全球商业中对于提高服务效率起着重要作用。在SER的文献中,许多技术一直在使用深度学习来提取和学习特征。最近,我们提出了一种用于深度残差局部特征学习块(DeepResLFLB)的端到端学习方法。端到端学习的优点是工程工作量低且超参数调整较少。然而,这种学习方法很容易陷入过拟合问题。因此,本文描述了“验证到分类”框架的概念,以应用于从情感信息特征空间中提取的学习向量。该框架由两个重要部分组成:语音情感学习和识别。在语音情感学习中,包括两个步骤:语音情感验证注册训练和预测,带有挤压激励(SE)块的残差学习(ResNet)被用作这两个步骤的核心组件,以提取情感状态向量并通过语音情感验证注册训练构建情感模型。然后将情感训练模型的域内预训练权重转移到预测步骤。作为语音情感学习的结果,通过EER验证的接受模型根据域外预训练权重被转移到语音情感识别,这些权重已准备好使用经典机器学习方法进行分类。以这种方式,合适的损失函数对于处理情感向量很重要。这里提出了两种损失函数:角度原型损失和带有角度原型损失的softmax损失。基于两个公开可用的数据集:Emo-DB和RAVDESS,它们都具有高质量和低质量环境。实验结果表明,当通过标准指标:EER、准确率、精确率、召回率和F1分数进行评估时,我们提出的方法可以显著提高泛化性能和可解释的情感结果。