Department of Signal Theory and Communications, University of Alcalá, 28805 Alcalá de Henares, Madrid, Spain.
Sensors (Basel). 2022 Mar 18;22(6):2343. doi: 10.3390/s22062343.
Every human being experiences emotions daily, e.g., joy, sadness, fear, anger. These might be revealed through speech-words are often accompanied by our emotional states when we talk. Different acoustic emotional databases are freely available for solving the Emotional Speech Recognition (ESR) task. Unfortunately, many of them were generated under non-real-world conditions, i.e., actors played emotions, and recorded emotions were under fictitious circumstances where noise is non-existent. Another weakness in the design of emotion recognition systems is the scarcity of enough patterns in the available databases, causing generalization problems and leading to overfitting. This paper examines how different recording environmental elements impact system performance using a simple logistic regression algorithm. Specifically, we conducted experiments simulating different scenarios, using different levels of Gaussian white noise, real-world noise, and reverberation. The results from this research show a performance deterioration in all scenarios, increasing the error probability from 25.57% to 79.13% in the worst case. Additionally, a virtual enlargement method and a robust multi-scenario speech-based emotion recognition system are proposed. Our system's average error probability of 34.57% is comparable to the best-case scenario with 31.55%. The findings support the prediction that simulated emotional speech databases do not offer sufficient closeness to real scenarios.
每个人每天都会体验情绪,例如喜悦、悲伤、恐惧、愤怒。这些情绪可能会通过言语表现出来,因为我们在说话时往往伴随着情绪状态。为了解决情感语音识别(ESR)任务,有许多免费的声学情感数据库可供使用。不幸的是,其中许多数据库是在非真实世界条件下生成的,例如演员表演情绪,录制的情绪是在不存在噪声的虚构情况下进行的。情感识别系统设计的另一个弱点是可用数据库中模式的稀缺性,这会导致泛化问题并导致过拟合。本文使用简单的逻辑回归算法研究了不同记录环境因素如何影响系统性能。具体来说,我们进行了实验,模拟了不同的场景,使用了不同水平的高斯白噪声、真实世界的噪声和混响。这项研究的结果表明,所有场景的性能都有所下降,在最坏的情况下,错误概率从 25.57%增加到 79.13%。此外,还提出了一种虚拟放大方法和一种稳健的多场景基于语音的情感识别系统。我们系统的平均错误概率为 34.57%,与最佳情况的 31.55%相当。这些发现支持了这样的预测,即模拟情感语音数据库与真实场景的相似度不足。