State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, 210023, China; Department of Computer Science and Technology, Nanjing University, Nanjing, 210023, China.
Neural Netw. 2023 May;162:443-455. doi: 10.1016/j.neunet.2023.03.003. Epub 2023 Mar 5.
Most multimodal learning methods assume that all modalities are always available in data. However, in real-world applications, the assumption is often violated due to privacy protection, sensor failure etc. Previous works for incomplete multimodal learning often suffer from one of the following drawbacks: introducing noise, lacking flexibility to missing patterns and failing to capture interactions between modalities. To overcome these challenges, we propose a COntrastive Masked-attention model (COM). The framework performs cross-modal contrastive learning with GAN-based augmentation to reduce modality gap, and employs a masked-attention model to capture interactions between modalities. The augmentation adapts cross-modal contrastive learning to suit incomplete case by a two-player game, improving the effectiveness of multimodal representations. Interactions between modalities are modeled by stacking self-attention blocks, and attention masks limit them on the observed modalities to avoid extra noise. All kinds of modality combinations share a unified architecture, so the model is flexible to different missing patterns. Extensive experiments on six datasets demonstrate the effectiveness and robustness of the proposed method for incomplete multimodal learning.
大多数多模态学习方法都假设所有模态在数据中都是始终可用的。然而,在实际应用中,由于隐私保护、传感器故障等原因,这种假设往往会被违反。以前的不完全多模态学习工作往往存在以下缺点之一:引入噪声、对缺失模式缺乏灵活性以及无法捕捉模态之间的相互作用。为了克服这些挑战,我们提出了一种 COntrastive Masked-attention 模型(COM)。该框架通过基于 GAN 的增强进行跨模态对比学习,以减少模态差距,并采用掩蔽注意力模型来捕捉模态之间的相互作用。增强通过两名玩家的游戏来适应不完全案例的跨模态对比学习,提高了多模态表示的有效性。模态之间的相互作用通过堆叠自注意块来建模,并且注意力掩码将其限制在观察到的模态上,以避免额外的噪声。各种模态组合共享一个统一的架构,因此模型对不同的缺失模式具有灵活性。在六个数据集上的广泛实验表明,该方法对于不完全多模态学习是有效和稳健的。