Wang Yadi, Guo Xiaoding, Hou Xianhong, Miao Zhijun, Yang Xiaojin, Guo Jinkai
School of Computer and Information Engineering, Henan University, Kaifeng, 475004, China; Henan Key Laboratory of Big Data Analysis and Processing, Kaifeng, 475004, China.
School of Computer and Information Engineering, Henan University, Kaifeng, 475004, China.
Neural Netw. 2025 Aug;188:107483. doi: 10.1016/j.neunet.2025.107483. Epub 2025 Apr 25.
Multimodal emotion recognition focuses on the prediction of emotions using text, visual and acoustic modalities, and some results have been generated in this field. Previous approaches fall short in two aspects, one is the processing of complementary information among modalities, the other is how to avoid the long-term dependency and select the most important joint modal features. In this paper, we propose a new multimodal emotion recognition framework MSRG, which consists of feature extraction (FE), emotional intensity attention (EIA), time-step level fusion (TLF), utterance level fusion (ULF), and sentiment inference module (SIM). EIA is divided into adaptive multimodal linear pooling (AMLP) and joint cross-attention fusion (JCAF), where AMLP adopts the adaptive strategy of multimodal fusion to dynamically calculate the adaptive coefficients of three modalities, then performs the pooling operation to obtain joint modal features. JCAF calculates the attention weights and attention features of each modality based on cross-correlation between individual and joint features. TLF performs feature alignment fusion at the time-step level, then uses the residual gating network (RGN) to process the time-step level fused sequences. The obtained time-step level fused features are then input into two fully connected layers and an activation layer to obtain the time-step level emotion intensity. ULF fuses the three modalities' utterance level representations by concatenating them and then inputs the obtained utterance level fused features into a fully connected layer to obtain the utterance level emotion intensity. Finally, both the time-step level emotion intensity and the utterance level emotion intensity are input into SIM to obtain the final emotion prediction results. Experiments demonstrate that MSRG achieves better prediction performance on CMU-MOSI and CMU-MOSEI datasets.
多模态情感识别专注于利用文本、视觉和声学模态来预测情感,并且在该领域已经取得了一些成果。以往的方法在两个方面存在不足,一是模态间互补信息的处理,二是如何避免长期依赖并选择最重要的联合模态特征。在本文中,我们提出了一种新的多模态情感识别框架MSRG,它由特征提取(FE)、情感强度注意力(EIA)、时间步级融合(TLF)、话语级融合(ULF)和情感推理模块(SIM)组成。EIA分为自适应多模态线性池化(AMLP)和联合交叉注意力融合(JCAF),其中AMLP采用多模态融合的自适应策略动态计算三种模态的自适应系数,然后进行池化操作以获得联合模态特征。JCAF基于个体特征与联合特征之间的互相关计算各模态的注意力权重和注意力特征。TLF在时间步级进行特征对齐融合,然后使用残差门控网络(RGN)处理时间步级融合序列。将得到的时间步级融合特征输入到两个全连接层和一个激活层以获得时间步级情感强度。ULF通过拼接融合三种模态的话语级表示,然后将得到的话语级融合特征输入到一个全连接层以获得话语级情感强度。最后,将时间步级情感强度和话语级情感强度都输入到SIM中以获得最终的情感预测结果。实验表明,MSRG在CMU-MOSI和CMU-MOSEI数据集上取得了更好的预测性能。