Department of Computer Science, College of Computers and Information Technology, Taif University, Taif 21944, Saudi Arabia.
Comput Biol Med. 2024 Sep;179:108841. doi: 10.1016/j.compbiomed.2024.108841. Epub 2024 Jul 12.
Speech emotion recognition (SER) stands as a prominent and dynamic research field in data science due to its extensive application in various domains such as psychological assessment, mobile services, and computer games, mobile services. In previous research, numerous studies utilized manually engineered features for emotion classification, resulting in commendable accuracy. However, these features tend to underperform in complex scenarios, leading to reduced classification accuracy. These scenarios include: 1. Datasets that contain diverse speech patterns, dialects, accents, or variations in emotional expressions. 2. Data with background noise. 3. Scenarios where the distribution of emotions varies significantly across datasets can be challenging. 4. Combining datasets from different sources introduce complexities due to variations in recording conditions, data quality, and emotional expressions. Consequently, there is a need to improve the classification performance of SER techniques. To address this, a novel SER framework was introduced in this study. Prior to feature extraction, signal preprocessing and data augmentation methods were applied to augment the available data, resulting in the derivation of 18 informative features from each signal. The discriminative feature set was obtained using feature selection techniques which was then utilized as input for emotion recognition using the SAVEE, RAVDESS, and EMO-DB datasets. Furthermore, this research also implemented a cross-corpus model that incorporated all speech files related to common emotions from three datasets. The experimental outcomes demonstrated the superior performance of SER framework compared to existing frameworks in the field. Notably, the framework presented in this study achieved remarkable accuracy rates across various datasets. Specifically, the proposed model obtained an accuracy of 95%, 94%,97%, and 97% on SAVEE, RAVDESS, EMO-DB and cross-corpus datasets respectively. These results underscore the significant contribution of our proposed framework to the field of SER.
语音情感识别(SER)作为数据科学中一个重要且活跃的研究领域,由于其在心理评估、移动服务和计算机游戏等各个领域的广泛应用而备受关注。在之前的研究中,许多研究使用人工设计的特征进行情感分类,取得了令人瞩目的准确性。然而,这些特征在复杂场景下的表现往往不尽如人意,导致分类准确性降低。这些场景包括:1. 包含各种语音模式、方言、口音或情感表达方式变化的数据集;2. 存在背景噪声的数据;3. 不同数据集之间的情感分布差异较大的情况;4. 来自不同来源的数据组合会因记录条件、数据质量和情感表达的变化而带来复杂性。因此,需要提高 SER 技术的分类性能。为了解决这个问题,本研究提出了一种新的 SER 框架。在特征提取之前,应用了信号预处理和数据增强方法来扩充可用数据,从而从每个信号中提取出 18 个有信息量的特征。使用特征选择技术获得了有区分性的特征集,然后将其用作 SAVEE、RAVDESS 和 EMO-DB 数据集的情感识别输入。此外,本研究还实现了一个跨语料库模型,该模型整合了三个数据集与常见情感相关的所有语音文件。实验结果表明,与该领域现有的框架相比,SER 框架的性能更优。具体而言,该研究提出的模型在各种数据集上的准确率均达到 95%、94%、97%和 97%。这些结果突显了我们提出的框架对 SER 领域的重要贡献。