Xu Zhijing, Gao Yang
College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China.
Math Biosci Eng. 2024 Jan 17;21(2):2488-2514. doi: 10.3934/mbe.2024110.
Multimodal emotion analysis involves the integration of information from various modalities to better understand human emotions. In this paper, we propose the Cross-modal Emotion Recognition based on multi-layer semantic fusion (CM-MSF) model, which aims to leverage the complementarity of important information between modalities and extract advanced features in an adaptive manner. To achieve comprehensive and rich feature extraction from multimodal sources, considering different dimensions and depth levels, we design a parallel deep learning algorithm module that focuses on extracting features from individual modalities, ensuring cost-effective alignment of extracted features. Furthermore, a cascaded cross-modal encoder module based on Bidirectional Long Short-Term Memory (BILSTM) layer and Convolutional 1D (ConV1d) is introduced to facilitate inter-modal information complementation. This module enables the seamless integration of information across modalities, effectively addressing the challenges associated with signal heterogeneity. To facilitate flexible and adaptive information selection and delivery, we design the Mask-gated Fusion Networks (MGF-module), which combines masking technology with gating structures. This approach allows for precise control over the information flow of each modality through gating vectors, mitigating issues related to low recognition accuracy and emotional misjudgment caused by complex features and noisy redundant information. The CM-MSF model underwent evaluation using the widely recognized multimodal emotion recognition datasets CMU-MOSI and CMU-MOSEI. The experimental findings illustrate the exceptional performance of the model, with binary classification accuracies of 89.1% and 88.6%, as well as F1 scores of 87.9% and 88.1% on the CMU-MOSI and CMU-MOSEI datasets, respectively. These results unequivocally validate the effectiveness of our approach in accurately recognizing and classifying emotions.
多模态情感分析涉及整合来自各种模态的信息,以更好地理解人类情感。在本文中,我们提出了基于多层语义融合的跨模态情感识别(CM-MSF)模型,其目的是利用模态之间重要信息的互补性,并以自适应方式提取高级特征。为了从多模态源实现全面且丰富的特征提取,考虑到不同的维度和深度级别,我们设计了一个并行深度学习算法模块,该模块专注于从各个模态中提取特征,确保提取特征的经济高效对齐。此外,引入了基于双向长短期记忆(BILSTM)层和一维卷积(ConV1d)的级联跨模态编码器模块,以促进模态间信息互补。该模块能够实现跨模态信息的无缝集成,有效解决与信号异质性相关的挑战。为了促进灵活且自适应的信息选择与传递,我们设计了掩码门控融合网络(MGF-模块),它将掩码技术与门控结构相结合。这种方法允许通过门控向量精确控制每个模态的信息流,减轻由复杂特征和噪声冗余信息导致的识别准确率低和情感误判等问题。CM-MSF模型使用广泛认可的多模态情感识别数据集CMU-MOSI和CMU-MOSEI进行了评估。实验结果表明该模型具有卓越的性能,在CMU-MOSI和CMU-MOSEI数据集上的二元分类准确率分别为89.1%和88.6%,F1分数分别为87.9%和88.1%。这些结果明确验证了我们的方法在准确识别和分类情感方面的有效性。