Tao Huawei, Geng Lei, Shan Shuai, Mai Jingchao, Fu Hongliang
College of Information Science and Engineering, Henan University of Technology, Zhengzhou 450001, China.
Entropy (Basel). 2022 Jul 26;24(8):1025. doi: 10.3390/e24081025.
The quality of feature extraction plays a significant role in the performance of speech emotion recognition. In order to extract discriminative, affect-salient features from speech signals and then improve the performance of speech emotion recognition, in this paper, a multi-stream convolution-recurrent neural network based on attention mechanism (MSCRNN-A) is proposed. Firstly, a multi-stream sub-branches full convolution network (MSFCN) based on AlexNet is presented to limit the loss of emotional information. In MSFCN, sub-branches are added behind each pooling layer to retain the features of different resolutions, different features from which are fused by adding. Secondly, the MSFCN and Bi-LSTM network are combined to form a hybrid network to extract speech emotion features for the purpose of supplying the temporal structure information of emotional features. Finally, a feature fusion model based on a multi-head attention mechanism is developed to achieve the best fusion features. The proposed method uses an attention mechanism to calculate the contribution degree of different network features, and thereafter realizes the adaptive fusion of different network features by weighting different network features. Aiming to restrain the gradient divergence of the network, different network features and fusion features are connected through shortcut connection to obtain fusion features for recognition. The experimental results on three conventional SER corpora, CASIA, EMODB, and SAVEE, show that our proposed method significantly improves the network recognition performance, with a recognition rate superior to most of the existing state-of-the-art methods.
特征提取的质量在语音情感识别性能中起着重要作用。为了从语音信号中提取具有判别力的、情感显著的特征,进而提高语音情感识别的性能,本文提出了一种基于注意力机制的多流卷积循环神经网络(MSCRNN-A)。首先,提出了一种基于AlexNet的多流子分支全卷积网络(MSFCN),以限制情感信息的损失。在MSFCN中,在每个池化层后面添加子分支,以保留不同分辨率的特征,并通过相加融合来自不同子分支的不同特征。其次,将MSFCN和双向长短期记忆网络(Bi-LSTM)相结合,形成一个混合网络来提取语音情感特征,以提供情感特征的时间结构信息。最后,开发了一种基于多头注意力机制的特征融合模型,以获得最佳的融合特征。所提出的方法使用注意力机制来计算不同网络特征的贡献度,然后通过对不同网络特征加权来实现不同网络特征的自适应融合。为了抑制网络的梯度发散,不同网络特征和融合特征通过捷径连接进行连接,以获得用于识别的融合特征。在三个传统的语音情感识别语料库CASIA、EMODB和SAVEE上的实验结果表明,我们提出的方法显著提高了网络识别性能,识别率优于大多数现有的先进方法。