Wen Yuhua, Li Qifei, Zhou Yingying, Gao Yingming, Wen Zhengqi, Tao Jianhua, Li Ya
IEEE Trans Neural Netw Learn Syst. 2025 Jun 18;PP. doi: 10.1109/TNNLS.2025.3578618.
Multimodal sentiment analysis (MSA) integrates various modalities, such as text, image, and audio, to provide a more comprehensive understanding of sentiment. However, effective MSA is challenged by alignment and fusion issues. Alignment requires synchronizing both temporal and semantic information across modalities, while fusion involves integrating these aligned features into a unified representation. Existing methods often address alignment or fusion in isolation, leading to limitations in performance and efficiency. To tackle these issues, we propose a novel framework called dual-stream alignment with hierarchical bottleneck fusion (DashFusion). First, the dual-stream alignment module synchronizes multimodal features through temporal and semantic alignment. Temporal alignment employs cross-modal attention (CA) to establish frame-level correspondences among multimodal sequences. Semantic alignment ensures consistency across the feature space through contrastive learning. Second, supervised contrastive learning (SCL) leverages label information to refine the modality features. Finally, hierarchical bottleneck fusion (HBF) progressively integrates multimodal information through compressed bottleneck tokens, which achieves a balance between performance and computational efficiency. We evaluate DashFusion on three datasets: CMU-MOSI, CMU-MOSEI, and CH-SIMS. Experimental results demonstrate that DashFusion achieves state-of-the-art (SOTA) performance across various metrics, and ablation studies confirm the effectiveness of our alignment and fusion techniques. The codes for our experiments are available at https://github.com/ultramarineX/DashFusion.
多模态情感分析(MSA)整合了各种模态,如文本、图像和音频,以更全面地理解情感。然而,有效的多模态情感分析面临着对齐和融合问题的挑战。对齐需要跨模态同步时间和语义信息,而融合则涉及将这些对齐的特征整合为统一表示。现有方法通常孤立地处理对齐或融合,导致性能和效率受限。为了解决这些问题,我们提出了一种名为双流对齐与分层瓶颈融合(DashFusion)的新颖框架。首先,双流对齐模块通过时间和语义对齐来同步多模态特征。时间对齐采用跨模态注意力(CA)在多模态序列之间建立帧级对应关系。语义对齐通过对比学习确保特征空间的一致性。其次,监督对比学习(SCL)利用标签信息来细化模态特征。最后,分层瓶颈融合(HBF)通过压缩瓶颈令牌逐步整合多模态信息,在性能和计算效率之间实现平衡。我们在三个数据集上评估了DashFusion:CMU-MOSI、CMU-MOSEI和CH-SIMS。实验结果表明,DashFusion在各种指标上均达到了当前最优(SOTA)性能,消融研究证实了我们的对齐和融合技术的有效性。我们实验的代码可在https://github.com/ultramarineX/DashFusion获取。