Zhou Dongming, Deng Jinsheng, Pang Zhengbin, Li Wei
School of Computer Science, National University of Defense Technology, Deya Road, Changsha, 410003, Hunan, China.
College of Advanced Interdisciplinary Studies, National University of Defense Technology, Changsha 410073, Hunan, China.
Neural Netw. 2025 Apr;184:107078. doi: 10.1016/j.neunet.2024.107078. Epub 2024 Dec 30.
Vision-language navigation (VLN) is a challenging task that requires agents to capture the correlation between different modalities from redundant information according to instructions, and then make sequential decisions on visual scenes and text instructions in the action space. Recent research has focused on extracting visual features and enhancing text knowledge, ignoring the potential bias in multi-modal data and the problem of spurious correlations between vision and text. Therefore, this paper studies the relationship structure between multi-modal data from the perspective of causality and weakens the potential correlation between different modalities through cross-modal causality reasoning. We propose a novel visual language navigation based on double-gated intervention and confounder causality reasoning (DICCR). First, we decouple the datasets visual-text factors to construct a causality graph of confounder factors with cross-modal reasoning navigation. On this basis, we learn the causality between vision and text with a posterior probability and use confounder factors to block the interference of false association paths to agent decision-making. Then, we propose front-door and back-door causal intervention modules guided by semantic relations to reduce spurious biases in vision and semantics. On this basis, we design a joint local-global causal attention module that aggregates global feature representations through two different gated interventions. Finally, we design a multi-modal feature fusion matching algorithm (FFM), which combines the agent motion trajectory and multi-modal features to provide additional feedback auxiliary for continuous decision-making. We verified the model effectiveness on three benchmark datasets: R2R, REVERIE, and RxR. Experimental results show that DICCR achieved an increase of 3.25% and 4.13% in SPL and SR metrics on the R2R dataset. Compared with the baseline model, DICCR achieves state-of-the-art.
视觉语言导航(VLN)是一项具有挑战性的任务,它要求智能体根据指令从冗余信息中捕捉不同模态之间的关联,然后在动作空间中对视觉场景和文本指令做出顺序决策。最近的研究主要集中在提取视觉特征和增强文本知识上,而忽略了多模态数据中的潜在偏差以及视觉与文本之间的虚假关联问题。因此,本文从因果关系的角度研究多模态数据之间的关系结构,并通过跨模态因果推理削弱不同模态之间的潜在关联。我们提出了一种基于双门干预和混杂因素因果推理(DICCR)的新型视觉语言导航方法。首先,我们将数据集的视觉-文本因素解耦,构建一个具有跨模态推理导航的混杂因素因果图。在此基础上,我们以后验概率学习视觉与文本之间的因果关系,并使用混杂因素来阻断虚假关联路径对智能体决策的干扰。然后,我们提出由语义关系引导的前门和后门因果干预模块,以减少视觉和语义中的虚假偏差。在此基础上,我们设计了一个联合局部-全局因果注意力模块,通过两种不同的门控干预聚合全局特征表示。最后,我们设计了一种多模态特征融合匹配算法(FFM),它结合智能体运动轨迹和多模态特征,为连续决策提供额外的反馈辅助。我们在三个基准数据集R2R、REVERIE和RxR上验证了模型的有效性。实验结果表明,DICCR在R2R数据集上的SPL和SR指标分别提高了3.25%和4.13%。与基线模型相比,DICCR达到了当前最优水平。