Abbaschian Babak, Elmaghraby Adel
Computer Science and Engineering Department, University of Louisville, Louisville, KY 40292, USA.
Sensors (Basel). 2025 Mar 22;25(7):1991. doi: 10.3390/s25071991.
The focus on Speech Emotion Recognition has dramatically increased in recent years, driven by the need for automatic speech-recognition-based systems and intelligent assistants to enhance user experience by incorporating emotional content. While deep learning techniques have significantly advanced SER systems, their robustness concerning speaker gender and out-of-distribution data has not been thoroughly examined. Furthermore, standards for SER remain rooted in landmark papers from the 2000s, even though modern deep learning architectures can achieve comparable or superior results to the state of the art of that era. In this research, we address these challenges by creating a new super corpus from existing databases, providing a larger pool of samples. We benchmark this dataset using various deep learning architectures, setting a new baseline for the task. Additionally, our experiments reveal that models trained on this super corpus demonstrate superior generalization and accuracy and exhibit lower gender bias compared to models trained on individual databases. We further show that traditional preprocessing techniques, such as denoising and normalization, are insufficient to address inherent biases in the data. However, our data augmentation approach effectively shifts these biases, improving model fairness across gender groups and emotions and, in some cases, fully debiasing the models.
近年来,由于基于自动语音识别的系统和智能助手需要通过融入情感内容来提升用户体验,对语音情感识别的关注显著增加。虽然深度学习技术极大地推动了语音情感识别系统的发展,但其在说话者性别和分布外数据方面的鲁棒性尚未得到充分研究。此外,语音情感识别的标准仍然基于21世纪初的标志性论文,尽管现代深度学习架构能够取得与那个时代的先进水平相当甚至更优的结果。在本研究中,我们通过从现有数据库创建一个新的超级语料库来应对这些挑战,该语料库提供了更大的样本池。我们使用各种深度学习架构对这个数据集进行基准测试,为该任务设定了一个新的基线。此外,我们的实验表明,与在单个数据库上训练的模型相比,在这个超级语料库上训练的模型表现出更好的泛化能力和准确性,并且性别偏差更低。我们进一步表明,传统的预处理技术,如去噪和归一化,不足以解决数据中的固有偏差。然而,我们的数据增强方法有效地改变了这些偏差,提高了模型在不同性别群体和情感之间的公平性,在某些情况下,还能完全消除模型的偏差。