基于注意力机制的预训练深度卷积神经网络语音情感识别模型

Pre-trained Deep Convolution Neural Network Model With Attention for Speech Emotion Recognition.

作者信息

Zhang Hua, Gou Ruoyun, Shang Jili, Shen Fangyao, Wu Yifan, Dai Guojun

机构信息

School of Computer Science and Technology, HangZhou Dianzi University, Hangzhou, China.

Key Laboratory of Network Multimedia Technology of Zhejiang Province, Zhejiang University, Hangzhou, China.

出版信息

Front Physiol. 2021 Mar 2;12:643202. doi: 10.3389/fphys.2021.643202. eCollection 2021.

DOI:10.3389/fphys.2021.643202

PMID:33737889

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7962985/

Abstract

Speech emotion recognition (SER) is a difficult and challenging task because of the affective variances between different speakers. The performances of SER are extremely reliant on the extracted features from speech signals. To establish an effective features extracting and classification model is still a challenging task. In this paper, we propose a new method for SER based on Deep Convolution Neural Network (DCNN) and Bidirectional Long Short-Term Memory with Attention (BLSTMwA) model (DCNN-BLSTMwA). We first preprocess the speech samples by data enhancement and datasets balancing. Secondly, we extract three-channel of log Mel-spectrograms (static, delta, and delta-delta) as DCNN input. Then the DCNN model pre-trained on ImageNet dataset is applied to generate the segment-level features. We stack these features of a sentence into utterance-level features. Next, we adopt BLSTM to learn the high-level emotional features for temporal summarization, followed by an attention layer which can focus on emotionally relevant features. Finally, the learned high-level emotional features are fed into the Deep Neural Network (DNN) to predict the final emotion. Experiments on EMO-DB and IEMOCAP database obtain the unweighted average recall (UAR) of 87.86 and 68.50%, respectively, which are better than most popular SER methods and demonstrate the effectiveness of our propose method.

摘要

语音情感识别（SER）是一项困难且具有挑战性的任务，因为不同说话者之间存在情感差异。SER的性能极大地依赖于从语音信号中提取的特征。建立一个有效的特征提取和分类模型仍然是一项具有挑战性的任务。在本文中，我们提出了一种基于深度卷积神经网络（DCNN）和带注意力的双向长短期记忆（BLSTMwA）模型（DCNN-BLSTMwA）的SER新方法。我们首先通过数据增强和数据集平衡对语音样本进行预处理。其次，我们提取三通道的对数梅尔频谱图（静态、一阶差分和二阶差分）作为DCNN的输入。然后，应用在ImageNet数据集上预训练的DCNN模型来生成片段级特征。我们将一个句子的这些特征堆叠成语句级特征。接下来，我们采用BLSTM来学习用于时间汇总的高级情感特征，随后是一个可以关注情感相关特征的注意力层。最后，将学习到的高级情感特征输入到深度神经网络（DNN）中以预测最终情感。在EMO-DB和IEMOCAP数据库上的实验分别获得了87.86%和68.50%的无加权平均召回率（UAR），这优于大多数流行的SER方法，并证明了我们提出的方法的有效性。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

基于注意力机制的预训练深度卷积神经网络语音情感识别模型

Pre-trained Deep Convolution Neural Network Model With Attention for Speech Emotion Recognition.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

基于注意力机制的预训练深度卷积神经网络语音情感识别模型

Pre-trained Deep Convolution Neural Network Model With Attention for Speech Emotion Recognition.

作者信息

机构信息

出版信息

相似文献

引用本文的文献