Centre for Data Science, School of Computer Science, Queensland University of Technology, 4000, Brisbane, Australia.
Neural Netw. 2024 Nov;179:106553. doi: 10.1016/j.neunet.2024.106553. Epub 2024 Jul 17.
Multi-modal representation learning has received significant attention across diverse research domains due to its ability to model a scenario comprehensively. Learning the cross-modal interactions is essential to combining multi-modal data into a joint representation. However, conventional cross-attention mechanisms can produce noisy and non-meaningful values in the absence of useful cross-modal interactions among input features, thereby introducing uncertainty into the feature representation. These factors have the potential to degrade the performance of downstream tasks. This paper introduces a novel Pre-gating and Contextual Attention Gate (PCAG) module for multi-modal learning comprising two gating mechanisms that operate at distinct information processing levels within the deep learning model. The first gate filters out interactions that lack informativeness for the downstream task, while the second gate reduces the uncertainty introduced by the cross-attention module. Experimental results on eight multi-modal classification tasks spanning various domains show that the multi-modal fusion model with PCAG outperforms state-of-the-art multi-modal fusion models. Additionally, we elucidate how PCAG effectively processes cross-modality interactions.
多模态表示学习因其能够全面建模场景而在多个研究领域受到广泛关注。学习跨模态交互对于将多模态数据组合成联合表示至关重要。然而,在输入特征之间缺乏有用的跨模态交互的情况下,传统的交叉注意机制可能会产生嘈杂和无意义的值,从而给特征表示带来不确定性。这些因素有可能降低下游任务的性能。本文提出了一种新的用于多模态学习的预门控和上下文注意门 (PCAG) 模块,该模块由两个门控机制组成,它们在深度学习模型内的不同信息处理级别上运行。第一个门控机制过滤掉对下游任务缺乏信息量的交互,而第二个门控机制则减少了交叉注意模块引入的不确定性。在涵盖各种领域的八个多模态分类任务上的实验结果表明,具有 PCAG 的多模态融合模型优于最先进的多模态融合模型。此外,我们阐明了 PCAG 如何有效地处理跨模态交互。