使用手工特征的轻量级深度神经网络集成模型进行语音情感识别。

Speech emotion recognition with light weight deep neural ensemble model using hand crafted features.

作者信息

Chowdhury Jaher Hassan, Ramanna Sheela, Kotecha Ketan

机构信息

The University of Winnipeg, 515 Portage Avenue, Winnipeg, Manitoba, Canada.

Symbiosis International (Deemed University), Pune, Maharashtra, 412115, India.

出版信息

Sci Rep. 2025 Apr 7;15(1):11824. doi: 10.1038/s41598-025-95734-z.

DOI:10.1038/s41598-025-95734-z

PMID:40195486

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11977261/

Abstract

Automatic emotion detection has become crucial in various domains, such as healthcare, neuroscience, smart home technologies, and human-computer interaction (HCI). Speech Emotion Recognition (SER) has attracted considerable attention because of its potential to improve conversational robotics and human-computer interaction (HCI) systems. Despite its promise, SER research faces challenges such as data scarcity, the subjective nature of emotions, and complex feature extraction methods. In this paper, we seek to investigate whether a lightweight deep neural ensemble model (CNN and CNN_Bi-LSTM) using well-known hand-crafted features such as ZCR, RMSE, Chroma STFT, and MFCC would outperform models that use automatic feature extraction techniques (e.g., spectrogram-based methods) on benchmarked datasets. The focus of this paper is on the effectiveness of careful fine-tuning of the neural models with learning rate (LR) schedulers and applying regularization techniques. Our proposed ensemble model is validated using five publicly available datasets: RAVDESS, TESS, SAVEE, CREMA-D, and EmoDB. Accuracy, AUC-ROC, AUC-PRC, and F1-score metrics were used for performance testing, and the LIME (Local Interpretable Model-agnostic Explanations) technique was used for interpreting the results of our proposed ensemble model. Results indicate that our ensemble model consistently outperforms individual models, as well as several compared models which include spectrogram-based models for the above datasets in terms of the evaluation metrics.

摘要

自动情感检测在医疗保健、神经科学、智能家居技术和人机交互（HCI）等各个领域变得至关重要。语音情感识别（SER）因其在改善对话机器人技术和人机交互（HCI）系统方面的潜力而备受关注。尽管有前景，但SER研究面临数据稀缺、情感的主观性以及复杂的特征提取方法等挑战。在本文中，我们试图研究使用诸如过零率（ZCR）、均方根误差（RMSE）、色度短时傅里叶变换（Chroma STFT）和梅尔频率倒谱系数（MFCC）等著名的手工制作特征的轻量级深度神经集成模型（卷积神经网络（CNN）和卷积神经网络-双向长短期记忆网络（CNN_Bi-LSTM））在基准数据集上是否会优于使用自动特征提取技术（例如基于频谱图的方法）的模型。本文的重点是使用学习率（LR）调度器对神经模型进行仔细微调并应用正则化技术的有效性。我们提出的集成模型使用五个公开可用的数据集进行了验证：RAVDESS、TESS、SAVEE、CREMA-D和EmoDB。使用准确率、曲线下面积-ROC（AUC-ROC）、曲线下面积-PRC（AUC-PRC）和F1分数指标进行性能测试，并使用LIME（局部可解释模型无关解释）技术来解释我们提出的集成模型的结果。结果表明，在评估指标方面，我们的集成模型始终优于单个模型以及几个比较模型，其中包括上述数据集的基于频谱图的模型。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

使用手工特征的轻量级深度神经网络集成模型进行语音情感识别。

Speech emotion recognition with light weight deep neural ensemble model using hand crafted features.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

使用手工特征的轻量级深度神经网络集成模型进行语音情感识别。

Speech emotion recognition with light weight deep neural ensemble model using hand crafted features.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献