• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于卷积神经网络和多头卷积变换的语音情感识别。

Speech Emotion Recognition Using Convolution Neural Networks and Multi-Head Convolutional Transformer.

机构信息

Wireless Communication Ecosystem Research Unit, Department of Electrical Engineering, Chulalongkorn University, Bangkok 10330, Thailand.

Department of Electrical Engineering, Main Campus, University of Science & Technology, Bannu 28100, Pakistan.

出版信息

Sensors (Basel). 2023 Jul 7;23(13):6212. doi: 10.3390/s23136212.

DOI:10.3390/s23136212
PMID:37448062
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10346498/
Abstract

Speech emotion recognition (SER) is a challenging task in human-computer interaction (HCI) systems. One of the key challenges in speech emotion recognition is to extract the emotional features effectively from a speech utterance. Despite the promising results of recent studies, they generally do not leverage advanced fusion algorithms for the generation of effective representations of emotional features in speech utterances. To address this problem, we describe the fusion of spatial and temporal feature representations of speech emotion by parallelizing convolutional neural networks (CNNs) and a Transformer encoder for SER. We stack two parallel CNNs for spatial feature representation in parallel to a Transformer encoder for temporal feature representation, thereby simultaneously expanding the filter depth and reducing the feature map with an expressive hierarchical feature representation at a lower computational cost. We use the RAVDESS dataset to recognize eight different speech emotions. We augment and intensify the variations in the dataset to minimize model overfitting. Additive White Gaussian Noise (AWGN) is used to augment the RAVDESS dataset. With the spatial and sequential feature representations of CNNs and the Transformer, the SER model achieves 82.31% accuracy for eight emotions on a hold-out dataset. In addition, the SER system is evaluated with the IEMOCAP dataset and achieves 79.42% recognition accuracy for five emotions. Experimental results on the RAVDESS and IEMOCAP datasets show the success of the presented SER system and demonstrate an absolute performance improvement over the state-of-the-art (SOTA) models.

摘要

语音情感识别(SER)是人机交互(HCI)系统中的一项具有挑战性的任务。在语音情感识别中,一个关键挑战是从语音话语中有效地提取情感特征。尽管最近的研究取得了有希望的结果,但它们通常没有利用先进的融合算法来生成语音话语中情感特征的有效表示。为了解决这个问题,我们描述了通过并行化卷积神经网络(CNN)和 Transformer 编码器来融合语音情感的空间和时间特征表示。我们并行堆叠两个用于空间特征表示的并行 CNN 以及一个用于时间特征表示的 Transformer 编码器,从而同时扩展滤波器深度并降低特征图的表示,同时以较低的计算成本实现具有表现力的分层特征表示。我们使用 RAVDESS 数据集来识别八种不同的语音情感。我们通过添加加性白高斯噪声(AWGN)来扩充 RAVDESS 数据集,并强化数据集的变化,以最大限度地减少模型过拟合。使用 CNN 的空间和顺序特征表示以及 Transformer,SER 模型在验证数据集上对八种情感的准确率达到 82.31%。此外,该 SER 系统还在 IEMOCAP 数据集上进行了评估,对五种情感的识别准确率达到 79.42%。在 RAVDESS 和 IEMOCAP 数据集上的实验结果表明了所提出的 SER 系统的成功,并证明了相对于最先进(SOTA)模型的绝对性能提升。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1f70/10346498/8e175f001e97/sensors-23-06212-g010.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1f70/10346498/71f6d5a73339/sensors-23-06212-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1f70/10346498/e4c5be5562a8/sensors-23-06212-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1f70/10346498/0f14c0cc0b2b/sensors-23-06212-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1f70/10346498/a7f935565b90/sensors-23-06212-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1f70/10346498/ce88977aa70f/sensors-23-06212-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1f70/10346498/412cc016a9a6/sensors-23-06212-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1f70/10346498/df3bcd612ddc/sensors-23-06212-g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1f70/10346498/7baac8204bcc/sensors-23-06212-g008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1f70/10346498/ae813657a5fe/sensors-23-06212-g009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1f70/10346498/8e175f001e97/sensors-23-06212-g010.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1f70/10346498/71f6d5a73339/sensors-23-06212-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1f70/10346498/e4c5be5562a8/sensors-23-06212-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1f70/10346498/0f14c0cc0b2b/sensors-23-06212-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1f70/10346498/a7f935565b90/sensors-23-06212-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1f70/10346498/ce88977aa70f/sensors-23-06212-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1f70/10346498/412cc016a9a6/sensors-23-06212-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1f70/10346498/df3bcd612ddc/sensors-23-06212-g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1f70/10346498/7baac8204bcc/sensors-23-06212-g008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1f70/10346498/ae813657a5fe/sensors-23-06212-g009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1f70/10346498/8e175f001e97/sensors-23-06212-g010.jpg

相似文献

1
Speech Emotion Recognition Using Convolution Neural Networks and Multi-Head Convolutional Transformer.基于卷积神经网络和多头卷积变换的语音情感识别。
Sensors (Basel). 2023 Jul 7;23(13):6212. doi: 10.3390/s23136212.
2
Impact of Feature Selection Algorithm on Speech Emotion Recognition Using Deep Convolutional Neural Network.基于深度卷积神经网络的特征选择算法对语音情感识别的影响。
Sensors (Basel). 2020 Oct 23;20(21):6008. doi: 10.3390/s20216008.
3
Fusion-ConvBERT: Parallel Convolution and BERT Fusion for Speech Emotion Recognition.融合卷积-BERT:语音情感识别的并行卷积和 BERT 融合。
Sensors (Basel). 2020 Nov 23;20(22):6688. doi: 10.3390/s20226688.
4
Fusing Visual Attention CNN and Bag of Visual Words for Cross-Corpus Speech Emotion Recognition.融合视觉注意 CNN 和视觉词袋用于跨语料库语音情感识别。
Sensors (Basel). 2020 Sep 28;20(19):5559. doi: 10.3390/s20195559.
5
Speech emotion recognition based on improved masking EMD and convolutional recurrent neural network.基于改进的掩码经验模态分解和卷积递归神经网络的语音情感识别
Front Psychol. 2023 Jan 9;13:1075624. doi: 10.3389/fpsyg.2022.1075624. eCollection 2022.
6
Combining a parallel 2D CNN with a self-attention Dilated Residual Network for CTC-based discrete speech emotion recognition.基于 CTC 的离散语音情感识别中,将二维并行卷积神经网络与自注意力空洞残差网络相结合。
Neural Netw. 2021 Sep;141:52-60. doi: 10.1016/j.neunet.2021.03.013. Epub 2021 Mar 23.
7
Deep-Net: A Lightweight CNN-Based Speech Emotion Recognition System Using Deep Frequency Features.深度网络:基于深度学习频率特征的轻量级 CNN 语音情感识别系统
Sensors (Basel). 2020 Sep 12;20(18):5212. doi: 10.3390/s20185212.
8
A CNN-Assisted Enhanced Audio Signal Processing for Speech Emotion Recognition.基于 CNN 的增强型音频信号处理在语音情感识别中的应用。
Sensors (Basel). 2019 Dec 28;20(1):183. doi: 10.3390/s20010183.
9
Age and Gender Recognition Using a Convolutional Neural Network with a Specially Designed Multi-Attention Module through Speech Spectrograms.基于语音声谱图的卷积神经网络与特别设计的多注意力模块的年龄与性别识别
Sensors (Basel). 2021 Sep 1;21(17):5892. doi: 10.3390/s21175892.
10
A comprehensive study on bilingual and multilingual speech emotion recognition using a two-pass classification scheme.使用双通分类方案进行双语和多语语音情感识别的综合研究。
PLoS One. 2019 Aug 15;14(8):e0220386. doi: 10.1371/journal.pone.0220386. eCollection 2019.

引用本文的文献

1
Analysis and Research on Spectrogram-Based Emotional Speech Signal Augmentation Algorithm.基于频谱图的情感语音信号增强算法分析与研究
Entropy (Basel). 2025 Jun 15;27(6):640. doi: 10.3390/e27060640.
2
An Improved BM3D Algorithm Based on Image Depth Feature Map and Structural Similarity Block-Matching.一种基于图像深度特征图和结构相似性块匹配的改进型BM3D算法
Sensors (Basel). 2023 Aug 18;23(16):7265. doi: 10.3390/s23167265.

本文引用的文献

1
Multi-Stream Convolution-Recurrent Neural Networks Based on Attention Mechanism Fusion for Speech Emotion Recognition.基于注意力机制融合的多流卷积循环神经网络用于语音情感识别
Entropy (Basel). 2022 Jul 26;24(8):1025. doi: 10.3390/e24081025.
2
Detection of fake news and hate speech for Ethiopian languages: a systematic review of the approaches.埃塞俄比亚语言中假新闻和仇恨言论的检测:方法的系统综述
J Big Data. 2022;9(1):66. doi: 10.1186/s40537-022-00619-x. Epub 2022 May 19.
3
The Impact of Attention Mechanisms on Speech Emotion Recognition.
注意力机制对语音情感识别的影响。
Sensors (Basel). 2021 Nov 12;21(22):7530. doi: 10.3390/s21227530.
4
Robust Multimodal Emotion Recognition from Conversation with Transformer-Based Crossmodality Fusion.基于 Transformer 的跨模态融合的对话中的稳健多模态情感识别。
Sensors (Basel). 2021 Jul 19;21(14):4913. doi: 10.3390/s21144913.
5
Combining a parallel 2D CNN with a self-attention Dilated Residual Network for CTC-based discrete speech emotion recognition.基于 CTC 的离散语音情感识别中,将二维并行卷积神经网络与自注意力空洞残差网络相结合。
Neural Netw. 2021 Sep;141:52-60. doi: 10.1016/j.neunet.2021.03.013. Epub 2021 Mar 23.
6
Improving Speech Emotion Recognition With Adversarial Data Augmentation Network.利用对抗性数据增强网络提高语音情感识别能力。
IEEE Trans Neural Netw Learn Syst. 2022 Jan;33(1):172-184. doi: 10.1109/TNNLS.2020.3027600. Epub 2022 Jan 5.
7
Deep-Net: A Lightweight CNN-Based Speech Emotion Recognition System Using Deep Frequency Features.深度网络:基于深度学习频率特征的轻量级 CNN 语音情感识别系统
Sensors (Basel). 2020 Sep 12;20(18):5212. doi: 10.3390/s20185212.
8
Multi-Modality Emotion Recognition Model with GAT-Based Multi-Head Inter-Modality Attention.基于图注意力网络的多头跨模态注意力多模态情感识别模型
Sensors (Basel). 2020 Aug 29;20(17):4894. doi: 10.3390/s20174894.
9
A CNN-Assisted Enhanced Audio Signal Processing for Speech Emotion Recognition.基于 CNN 的增强型音频信号处理在语音情感识别中的应用。
Sensors (Basel). 2019 Dec 28;20(1):183. doi: 10.3390/s20010183.
10
The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English.瑞尔森情感语音和歌曲音频视频数据库(RAVDESS):一组具有北美英语特色的动态、多模态面部和声音表情数据集。
PLoS One. 2018 May 16;13(5):e0196391. doi: 10.1371/journal.pone.0196391. eCollection 2018.