• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

使用手工特征的轻量级深度神经网络集成模型进行语音情感识别。

Speech emotion recognition with light weight deep neural ensemble model using hand crafted features.

作者信息

Chowdhury Jaher Hassan, Ramanna Sheela, Kotecha Ketan

机构信息

The University of Winnipeg, 515 Portage Avenue, Winnipeg, Manitoba, Canada.

Symbiosis International (Deemed University), Pune, Maharashtra, 412115, India.

出版信息

Sci Rep. 2025 Apr 7;15(1):11824. doi: 10.1038/s41598-025-95734-z.

DOI:10.1038/s41598-025-95734-z
PMID:40195486
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11977261/
Abstract

Automatic emotion detection has become crucial in various domains, such as healthcare, neuroscience, smart home technologies, and human-computer interaction (HCI). Speech Emotion Recognition (SER) has attracted considerable attention because of its potential to improve conversational robotics and human-computer interaction (HCI) systems. Despite its promise, SER research faces challenges such as data scarcity, the subjective nature of emotions, and complex feature extraction methods. In this paper, we seek to investigate whether a lightweight deep neural ensemble model (CNN and CNN_Bi-LSTM) using well-known hand-crafted features such as ZCR, RMSE, Chroma STFT, and MFCC would outperform models that use automatic feature extraction techniques (e.g., spectrogram-based methods) on benchmarked datasets. The focus of this paper is on the effectiveness of careful fine-tuning of the neural models with learning rate (LR) schedulers and applying regularization techniques. Our proposed ensemble model is validated using five publicly available datasets: RAVDESS, TESS, SAVEE, CREMA-D, and EmoDB. Accuracy, AUC-ROC, AUC-PRC, and F1-score metrics were used for performance testing, and the LIME (Local Interpretable Model-agnostic Explanations) technique was used for interpreting the results of our proposed ensemble model. Results indicate that our ensemble model consistently outperforms individual models, as well as several compared models which include spectrogram-based models for the above datasets in terms of the evaluation metrics.

摘要

自动情感检测在医疗保健、神经科学、智能家居技术和人机交互(HCI)等各个领域变得至关重要。语音情感识别(SER)因其在改善对话机器人技术和人机交互(HCI)系统方面的潜力而备受关注。尽管有前景,但SER研究面临数据稀缺、情感的主观性以及复杂的特征提取方法等挑战。在本文中,我们试图研究使用诸如过零率(ZCR)、均方根误差(RMSE)、色度短时傅里叶变换(Chroma STFT)和梅尔频率倒谱系数(MFCC)等著名的手工制作特征的轻量级深度神经集成模型(卷积神经网络(CNN)和卷积神经网络-双向长短期记忆网络(CNN_Bi-LSTM))在基准数据集上是否会优于使用自动特征提取技术(例如基于频谱图的方法)的模型。本文的重点是使用学习率(LR)调度器对神经模型进行仔细微调并应用正则化技术的有效性。我们提出的集成模型使用五个公开可用的数据集进行了验证:RAVDESS、TESS、SAVEE、CREMA-D和EmoDB。使用准确率、曲线下面积-ROC(AUC-ROC)、曲线下面积-PRC(AUC-PRC)和F1分数指标进行性能测试,并使用LIME(局部可解释模型无关解释)技术来解释我们提出的集成模型的结果。结果表明,在评估指标方面,我们的集成模型始终优于单个模型以及几个比较模型,其中包括上述数据集的基于频谱图的模型。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/11a2/11977261/f9db28a7aecb/41598_2025_95734_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/11a2/11977261/4dab54bb0c85/41598_2025_95734_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/11a2/11977261/904ceb0f3667/41598_2025_95734_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/11a2/11977261/8d91342beaf5/41598_2025_95734_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/11a2/11977261/f9db28a7aecb/41598_2025_95734_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/11a2/11977261/4dab54bb0c85/41598_2025_95734_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/11a2/11977261/904ceb0f3667/41598_2025_95734_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/11a2/11977261/8d91342beaf5/41598_2025_95734_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/11a2/11977261/f9db28a7aecb/41598_2025_95734_Fig4_HTML.jpg

相似文献

1
Speech emotion recognition with light weight deep neural ensemble model using hand crafted features.使用手工特征的轻量级深度神经网络集成模型进行语音情感识别。
Sci Rep. 2025 Apr 7;15(1):11824. doi: 10.1038/s41598-025-95734-z.
2
Fusing Visual Attention CNN and Bag of Visual Words for Cross-Corpus Speech Emotion Recognition.融合视觉注意 CNN 和视觉词袋用于跨语料库语音情感识别。
Sensors (Basel). 2020 Sep 28;20(19):5559. doi: 10.3390/s20195559.
3
A CNN-Assisted Enhanced Audio Signal Processing for Speech Emotion Recognition.基于 CNN 的增强型音频信号处理在语音情感识别中的应用。
Sensors (Basel). 2019 Dec 28;20(1):183. doi: 10.3390/s20010183.
4
A multi-dilated convolution network for speech emotion recognition.一种用于语音情感识别的多扩张卷积网络。
Sci Rep. 2025 Mar 10;15(1):8254. doi: 10.1038/s41598-025-92640-2.
5
IoT-Enabled WBAN and Machine Learning for Speech Emotion Recognition in Patients.物联网支持的 WBAN 和机器学习在患者语音情感识别中的应用。
Sensors (Basel). 2023 Mar 8;23(6):2948. doi: 10.3390/s23062948.
6
An enhanced speech emotion recognition using vision transformer.基于视觉转换器的增强型语音情感识别。
Sci Rep. 2024 Jun 7;14(1):13126. doi: 10.1038/s41598-024-63776-4.
7
Human-Computer Interaction with a Real-Time Speech Emotion Recognition with Ensembling Techniques 1D Convolution Neural Network and Attention.基于集成技术 1D 卷积神经网络和注意力的实时语音情感识别的人机交互
Sensors (Basel). 2023 Jan 26;23(3):1386. doi: 10.3390/s23031386.
8
Impact of Feature Selection Algorithm on Speech Emotion Recognition Using Deep Convolutional Neural Network.基于深度卷积神经网络的特征选择算法对语音情感识别的影响。
Sensors (Basel). 2020 Oct 23;20(21):6008. doi: 10.3390/s20216008.
9
A Combined CNN Architecture for Speech Emotion Recognition.一种用于语音情感识别的 CNN 架构组合。
Sensors (Basel). 2024 Sep 6;24(17):5797. doi: 10.3390/s24175797.
10
Speech Emotion Recognition Using Attention Model.基于注意力模型的语音情感识别
Int J Environ Res Public Health. 2023 Mar 14;20(6):5140. doi: 10.3390/ijerph20065140.

引用本文的文献

1
HGLER: A hierarchical heterogeneous graph networks for enhanced multimodal emotion recognition in conversations.HGLER:用于增强对话中多模态情感识别的分层异构图网络。
PLoS One. 2025 Sep 5;20(9):e0330632. doi: 10.1371/journal.pone.0330632. eCollection 2025.
2
A Comprehensive Review of Multimodal Emotion Recognition: Techniques, Challenges, and Future Directions.多模态情感识别综述:技术、挑战与未来方向
Biomimetics (Basel). 2025 Jun 27;10(7):418. doi: 10.3390/biomimetics10070418.

本文引用的文献

1
Validation of a Biomechanical Injury and Disease Assessment Platform Applying an Inertial-Based Biosensor and Axis Vector Computation.应用基于惯性的生物传感器和轴向量计算的生物力学损伤与疾病评估平台的验证
Electronics (Basel). 2023 Sep 1;12(17). doi: 10.3390/electronics12173694. Epub 2023 Aug 31.
2
Implementation of Lightweight Convolutional Neural Networks via Layer-Wise Differentiable Compression.通过逐层可微分压缩实现轻量化卷积神经网络。
Sensors (Basel). 2021 May 16;21(10):3464. doi: 10.3390/s21103464.
3
The Effect of Co-Verbal Remote Touch on Electrodermal Activity and Emotional Response in Dyadic Discourse.
共话远程触摸对二人对话中皮肤电活动和情绪反应的影响。
Sensors (Basel). 2020 Dec 29;21(1):168. doi: 10.3390/s21010168.
4
EEG-Based Emotion Recognition: A State-of-the-Art Review of Current Trends and Opportunities.基于脑电图的情绪识别:当前趋势和机遇的最新综述。
Comput Intell Neurosci. 2020 Sep 16;2020:8875426. doi: 10.1155/2020/8875426. eCollection 2020.
5
Fusing Visual Attention CNN and Bag of Visual Words for Cross-Corpus Speech Emotion Recognition.融合视觉注意 CNN 和视觉词袋用于跨语料库语音情感识别。
Sensors (Basel). 2020 Sep 28;20(19):5559. doi: 10.3390/s20195559.
6
Predictions for COVID-19 with deep learning models of LSTM, GRU and Bi-LSTM.使用长短期记忆网络(LSTM)、门控循环单元(GRU)和双向长短期记忆网络(Bi-LSTM)深度学习模型对新型冠状病毒肺炎(COVID-19)进行预测。
Chaos Solitons Fractals. 2020 Nov;140:110212. doi: 10.1016/j.chaos.2020.110212. Epub 2020 Aug 19.
7
A CNN-Assisted Enhanced Audio Signal Processing for Speech Emotion Recognition.基于 CNN 的增强型音频信号处理在语音情感识别中的应用。
Sensors (Basel). 2019 Dec 28;20(1):183. doi: 10.3390/s20010183.
8
The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English.瑞尔森情感语音和歌曲音频视频数据库(RAVDESS):一组具有北美英语特色的动态、多模态面部和声音表情数据集。
PLoS One. 2018 May 16;13(5):e0196391. doi: 10.1371/journal.pone.0196391. eCollection 2018.
9
Recognizing Patients' Emotions: Teaching Health Care Providers to Interpret Facial Expressions.识别患者的情绪:教导医疗保健提供者解读面部表情。
Acad Med. 2016 Sep;91(9):1270-5. doi: 10.1097/ACM.0000000000001163.
10
CREMA-D: Crowd-sourced Emotional Multimodal Actors Dataset.CREMA-D:众包情感多模态演员数据集。
IEEE Trans Affect Comput. 2014 Oct-Dec;5(4):377-390. doi: 10.1109/TAFFC.2014.2336244.