DICCR：用于视觉语言导航的双门干预与混杂因素因果推理

DICCR: Double-gated intervention and confounder causal reasoning for vision-language navigation.

作者信息

Zhou Dongming, Deng Jinsheng, Pang Zhengbin, Li Wei

机构信息

School of Computer Science, National University of Defense Technology, Deya Road, Changsha, 410003, Hunan, China.

College of Advanced Interdisciplinary Studies, National University of Defense Technology, Changsha 410073, Hunan, China.

出版信息

Neural Netw. 2025 Apr;184:107078. doi: 10.1016/j.neunet.2024.107078. Epub 2024 Dec 30.

DOI:10.1016/j.neunet.2024.107078

PMID:39778296

Abstract

Vision-language navigation (VLN) is a challenging task that requires agents to capture the correlation between different modalities from redundant information according to instructions, and then make sequential decisions on visual scenes and text instructions in the action space. Recent research has focused on extracting visual features and enhancing text knowledge, ignoring the potential bias in multi-modal data and the problem of spurious correlations between vision and text. Therefore, this paper studies the relationship structure between multi-modal data from the perspective of causality and weakens the potential correlation between different modalities through cross-modal causality reasoning. We propose a novel visual language navigation based on double-gated intervention and confounder causality reasoning (DICCR). First, we decouple the datasets visual-text factors to construct a causality graph of confounder factors with cross-modal reasoning navigation. On this basis, we learn the causality between vision and text with a posterior probability and use confounder factors to block the interference of false association paths to agent decision-making. Then, we propose front-door and back-door causal intervention modules guided by semantic relations to reduce spurious biases in vision and semantics. On this basis, we design a joint local-global causal attention module that aggregates global feature representations through two different gated interventions. Finally, we design a multi-modal feature fusion matching algorithm (FFM), which combines the agent motion trajectory and multi-modal features to provide additional feedback auxiliary for continuous decision-making. We verified the model effectiveness on three benchmark datasets: R2R, REVERIE, and RxR. Experimental results show that DICCR achieved an increase of 3.25% and 4.13% in SPL and SR metrics on the R2R dataset. Compared with the baseline model, DICCR achieves state-of-the-art.

摘要

视觉语言导航（VLN）是一项具有挑战性的任务，它要求智能体根据指令从冗余信息中捕捉不同模态之间的关联，然后在动作空间中对视觉场景和文本指令做出顺序决策。最近的研究主要集中在提取视觉特征和增强文本知识上，而忽略了多模态数据中的潜在偏差以及视觉与文本之间的虚假关联问题。因此，本文从因果关系的角度研究多模态数据之间的关系结构，并通过跨模态因果推理削弱不同模态之间的潜在关联。我们提出了一种基于双门干预和混杂因素因果推理（DICCR）的新型视觉语言导航方法。首先，我们将数据集的视觉-文本因素解耦，构建一个具有跨模态推理导航的混杂因素因果图。在此基础上，我们以后验概率学习视觉与文本之间的因果关系，并使用混杂因素来阻断虚假关联路径对智能体决策的干扰。然后，我们提出由语义关系引导的前门和后门因果干预模块，以减少视觉和语义中的虚假偏差。在此基础上，我们设计了一个联合局部-全局因果注意力模块，通过两种不同的门控干预聚合全局特征表示。最后，我们设计了一种多模态特征融合匹配算法（FFM），它结合智能体运动轨迹和多模态特征，为连续决策提供额外的反馈辅助。我们在三个基准数据集R2R、REVERIE和RxR上验证了模型的有效性。实验结果表明，DICCR在R2R数据集上的SPL和SR指标分别提高了3.25%和4.13%。与基线模型相比，DICCR达到了当前最优水平。

相似文献

DICCR: Double-gated intervention and confounder causal reasoning for vision-language navigation.DICCR：用于视觉语言导航的双门干预与混杂因素因果推理

Neural Netw. 2025 Apr;184:107078. doi: 10.1016/j.neunet.2024.107078. Epub 2024 Dec 30.

Cross-Modal Causal Relational Reasoning for Event-Level Visual Question Answering.用于事件级视觉问答的跨模态因果关系推理

IEEE Trans Pattern Anal Mach Intell. 2023 Oct;45(10):11624-11641. doi: 10.1109/TPAMI.2023.3284038. Epub 2023 Sep 5.

Vision-Language Navigation Policy Learning and Adaptation.视觉-语言导航策略学习与适应。

IEEE Trans Pattern Anal Mach Intell. 2021 Dec;43(12):4205-4216. doi: 10.1109/TPAMI.2020.2972281. Epub 2021 Nov 3.

Multi-grained visual pivot-guided multi-modal neural machine translation with text-aware cross-modal contrastive disentangling.基于文本感知跨模态对比解缠的多粒度视觉枢轴引导多模态神经机器翻译

Neural Netw. 2024 Oct;178:106403. doi: 10.1016/j.neunet.2024.106403. Epub 2024 May 23.

Supporting vision-language model few-shot inference with confounder-pruned knowledge prompt.通过混杂因素修剪知识提示支持视觉语言模型少样本推理。

Neural Netw. 2025 May;185:107173. doi: 10.1016/j.neunet.2025.107173. Epub 2025 Jan 18.

Vital information matching in vision-and-language navigation.视觉与语言导航中的重要信息匹配

Front Neurorobot. 2022 Nov 17;16:1035921. doi: 10.3389/fnbot.2022.1035921. eCollection 2022.

An effective spatial relational reasoning networks for visual question answering.用于视觉问答的有效的空间关系推理网络。

PLoS One. 2022 Nov 28;17(11):e0277693. doi: 10.1371/journal.pone.0277693. eCollection 2022.

Correctable Landmark Discovery via Large Models for Vision-Language Navigation.通过大型模型进行视觉语言导航的可校正地标发现

IEEE Trans Pattern Anal Mach Intell. 2024 Dec;46(12):8534-8548. doi: 10.1109/TPAMI.2024.3407759. Epub 2024 Nov 6.

Room-Object Entity Prompting and Reasoning for Embodied Referring Expression.用于具身指代表达的房间-物体实体提示与推理

IEEE Trans Pattern Anal Mach Intell. 2024 Feb;46(2):994-1010. doi: 10.1109/TPAMI.2023.3326851. Epub 2024 Jan 8.

What Does a Language-And-Vision Transformer See: The Impact of Semantic Information on Visual Representations.语言与视觉Transformer看到了什么：语义信息对视觉表征的影响。

Front Artif Intell. 2021 Dec 3;4:767971. doi: 10.3389/frai.2021.767971. eCollection 2021.

DICCR：用于视觉语言导航的双门干预与混杂因素因果推理

DICCR: Double-gated intervention and confounder causal reasoning for vision-language navigation.

作者信息

Zhou Dongming, Deng Jinsheng, Pang Zhengbin, Li Wei

机构信息

School of Computer Science, National University of Defense Technology, Deya Road, Changsha, 410003, Hunan, China.

College of Advanced Interdisciplinary Studies, National University of Defense Technology, Changsha 410073, Hunan, China.

出版信息

Neural Netw. 2025 Apr;184:107078. doi: 10.1016/j.neunet.2024.107078. Epub 2024 Dec 30.

DOI:10.1016/j.neunet.2024.107078

PMID:39778296

Abstract

摘要

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

DICCR：用于视觉语言导航的双门干预与混杂因素因果推理

DICCR: Double-gated intervention and confounder causal reasoning for vision-language navigation.

作者信息

机构信息

出版信息

相似文献

DICCR：用于视觉语言导航的双门干预与混杂因素因果推理

DICCR: Double-gated intervention and confounder causal reasoning for vision-language navigation.

作者信息

机构信息

出版信息

相似文献