College of Engineering, Al Faisal University, P.O. Box 50927, Riyadh 11533, Saudi Arabia.
Sensors (Basel). 2023 Jan 26;23(3):1386. doi: 10.3390/s23031386.
Emotions have a crucial function in the mental existence of humans. They are vital for identifying a person's behaviour and mental condition. Speech Emotion Recognition (SER) is extracting a speaker's emotional state from their speech signal. SER is a growing discipline in human-computer interaction, and it has recently attracted more significant interest. This is because there are not so many universal emotions; therefore, any intelligent system with enough computational capacity can educate itself to recognise them. However, the issue is that human speech is immensely diverse, making it difficult to create a single, standardised recipe for detecting hidden emotions. This work attempted to solve this research difficulty by combining a multilingual emotional dataset with building a more generalised and effective model for recognising human emotions. A two-step process was used to develop the model. The first stage involved the extraction of features, and the second stage involved the classification of the features that were extracted. ZCR, RMSE, and the renowned MFC coefficients were retrieved as features. Two proposed models, 1D CNN combined with LSTM and attention and a proprietary 2D CNN architecture, were used for classification. The outcomes demonstrated that the suggested 1D CNN with LSTM and attention performed better than the 2D CNN. For the EMO-DB, SAVEE, ANAD, and BAVED datasets, the model's accuracy was 96.72%, 97.13%, 96.72%, and 88.39%, respectively. The model beat several earlier efforts on the same datasets, demonstrating the generality and efficacy of recognising multiple emotions from various languages.
情绪在人类的精神存在中起着至关重要的作用。它们对于识别一个人的行为和心理状态至关重要。语音情感识别(SER)是从语音信号中提取说话者的情感状态。SER 是人机交互领域的一个新兴学科,最近引起了更多的关注。这是因为没有那么多普遍的情感;因此,任何具有足够计算能力的智能系统都可以自我教育来识别它们。然而,问题是人类的语音是非常多样化的,因此很难创建一个单一的、标准化的方法来检测隐藏的情感。这项工作试图通过结合多语言情感数据集和构建一个更通用和有效的模型来解决这个研究难题,以识别人类的情感。该模型采用两步法开发。第一阶段涉及特征提取,第二阶段涉及提取特征的分类。ZCR、RMSE 和著名的 MFC 系数被提取为特征。使用两种提出的模型,即 1D CNN 与 LSTM 和注意力相结合和专有的 2D CNN 架构,进行分类。结果表明,建议的带有 LSTM 和注意力的 1D CNN 比 2D CNN 表现更好。对于 EMO-DB、SAVEE、ANAD 和 BAVED 数据集,模型的准确率分别为 96.72%、97.13%、96.72%和 88.39%。该模型在同一数据集上击败了几项早期的研究成果,证明了从多种语言识别多种情感的通用性和有效性。