通过大型模型进行视觉语言导航的可校正地标发现

Correctable Landmark Discovery via Large Models for Vision-Language Navigation.

作者信息

Lin Bingqian, Nie Yunshuang, Wei Ziming, Zhu Yi, Xu Hang, Ma Shikui, Liu Jianzhuang, Liang Xiaodan

出版信息

IEEE Trans Pattern Anal Mach Intell. 2024 Dec;46(12):8534-8548. doi: 10.1109/TPAMI.2024.3407759. Epub 2024 Nov 6.

DOI:10.1109/TPAMI.2024.3407759

PMID:38819971

Abstract

Vision-Language Navigation (VLN) requires the agent to follow language instructions to reach a target position. A key factor for successful navigation is to align the landmarks implied in the instruction with diverse visual observations. However, previous VLN agents fail to perform accurate modality alignment especially in unexplored scenes, since they learn from limited navigation data and lack sufficient open-world alignment knowledge. In this work, we propose a new VLN paradigm, called COrrectable LaNdmark DiScOvery via Large ModEls (CONSOLE). In CONSOLE, we cast VLN as an open-world sequential landmark discovery problem, by introducing a novel correctable landmark discovery scheme based on two large models ChatGPT and CLIP. Specifically, we use ChatGPT to provide rich open-world landmark cooccurrence commonsense, and conduct CLIP-driven landmark discovery based on these commonsense priors. To mitigate the noise in the priors due to the lack of visual constraints, we introduce a learnable cooccurrence scoring module, which corrects the importance of each cooccurrence according to actual observations for accurate landmark discovery. We further design an observation enhancement strategy for an elegant combination of our framework with different VLN agents, where we utilize the corrected landmark features to obtain enhanced observation features for action decision. Extensive experimental results on multiple popular VLN benchmarks (R2R, REVERIE, R4R, RxR) show the significant superiority of CONSOLE over strong baselines. Especially, our CONSOLE establishes the new state-of-the-art results on R2R and R4R in unseen scenarios.

摘要

视觉语言导航（VLN）要求智能体遵循语言指令到达目标位置。成功导航的一个关键因素是将指令中隐含的地标与各种视觉观察结果对齐。然而，以前的VLN智能体未能进行准确的模态对齐，尤其是在未探索的场景中，因为它们从有限的导航数据中学习，并且缺乏足够的开放世界对齐知识。在这项工作中，我们提出了一种新的VLN范式，称为通过大模型进行可校正地标发现（CONSOLE）。在CONSOLE中，我们将VLN视为一个开放世界的顺序地标发现问题，通过引入一种基于两个大模型ChatGPT和CLIP的新颖的可校正地标发现方案。具体来说，我们使用ChatGPT提供丰富的开放世界地标共现常识，并基于这些常识先验进行CLIP驱动的地标发现。为了减轻由于缺乏视觉约束而导致的先验中的噪声，我们引入了一个可学习的共现评分模块，该模块根据实际观察结果校正每个共现的重要性，以进行准确的地标发现。我们进一步设计了一种观察增强策略，以便将我们的框架与不同的VLN智能体进行优雅的组合，在这种策略中，我们利用校正后的地标特征来获得用于动作决策的增强观察特征。在多个流行的VLN基准测试（R2R、REVERIE、R4R、RxR）上的大量实验结果表明，CONSOLE比强大的基线具有显著优势。特别是，我们的CONSOLE在未见场景下的R2R和R4R上建立了新的最先进结果。

相似文献

Correctable Landmark Discovery via Large Models for Vision-Language Navigation.通过大型模型进行视觉语言导航的可校正地标发现

IEEE Trans Pattern Anal Mach Intell. 2024 Dec;46(12):8534-8548. doi: 10.1109/TPAMI.2024.3407759. Epub 2024 Nov 6.

Towards Deviation-Robust Agent Navigation via Perturbation-Aware Contrastive Learning.通过扰动感知对比学习实现抗偏差智能体导航

IEEE Trans Pattern Anal Mach Intell. 2023 Oct;45(10):12535-12549. doi: 10.1109/TPAMI.2023.3273594. Epub 2023 Sep 5.

Outdoor Vision-and-Language Navigation Needs Object-Level Alignment.户外视觉与语言导航需要目标级对齐。

Sensors (Basel). 2023 Jun 29;23(13):6028. doi: 10.3390/s23136028.

HOP+: History-Enhanced and Order-Aware Pre-Training for Vision-and-Language Navigation.HOP+：用于视觉语言导航的具有历史增强和顺序感知的预训练。

IEEE Trans Pattern Anal Mach Intell. 2023 Jul;45(7):8524-8537. doi: 10.1109/TPAMI.2023.3234243. Epub 2023 Jun 5.

Self-Supervised 3-D Semantic Representation Learning for Vision-and-Language Navigation.用于视觉与语言导航的自监督3D语义表征学习

IEEE Trans Neural Netw Learn Syst. 2025 Apr;36(4):6738-6751. doi: 10.1109/TNNLS.2024.3395633. Epub 2025 Apr 4.

Vision-Language Navigation With Beam-Constrained Global Normalization.具有光束约束全局归一化的视觉语言导航

IEEE Trans Neural Netw Learn Syst. 2024 Jan;35(1):1352-1363. doi: 10.1109/TNNLS.2022.3183287. Epub 2024 Jan 4.

Adversarial Reinforced Instruction Attacker for Robust Vision-Language Navigation.对抗强化指令攻击的鲁棒视觉-语言导航

IEEE Trans Pattern Anal Mach Intell. 2022 Oct;44(10):7175-7189. doi: 10.1109/TPAMI.2021.3097435. Epub 2022 Sep 14.

Room-Object Entity Prompting and Reasoning for Embodied Referring Expression.用于具身指代表达的房间-物体实体提示与推理

IEEE Trans Pattern Anal Mach Intell. 2024 Feb;46(2):994-1010. doi: 10.1109/TPAMI.2023.3326851. Epub 2024 Jan 8.

Visual Perception Generalization for Vision-and-Language Navigation via Meta-Learning.通过元学习实现视觉与语言导航的视觉感知泛化

IEEE Trans Neural Netw Learn Syst. 2023 Aug;34(8):5193-5199. doi: 10.1109/TNNLS.2021.3122579. Epub 2023 Aug 4.

ETPNav: Evolving Topological Planning for Vision-Language Navigation in Continuous Environments.ETPNav：连续环境中视觉语言导航的演进拓扑规划

IEEE Trans Pattern Anal Mach Intell. 2024 Apr 9;PP. doi: 10.1109/TPAMI.2024.3386695.

通过大型模型进行视觉语言导航的可校正地标发现

Correctable Landmark Discovery via Large Models for Vision-Language Navigation.

作者信息

Lin Bingqian, Nie Yunshuang, Wei Ziming, Zhu Yi, Xu Hang, Ma Shikui, Liu Jianzhuang, Liang Xiaodan

出版信息

IEEE Trans Pattern Anal Mach Intell. 2024 Dec;46(12):8534-8548. doi: 10.1109/TPAMI.2024.3407759. Epub 2024 Nov 6.

DOI:10.1109/TPAMI.2024.3407759

PMID:38819971

Abstract

摘要

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

通过大型模型进行视觉语言导航的可校正地标发现

Correctable Landmark Discovery via Large Models for Vision-Language Navigation.

作者信息

出版信息

相似文献

通过大型模型进行视觉语言导航的可校正地标发现

Correctable Landmark Discovery via Large Models for Vision-Language Navigation.

作者信息

出版信息

相似文献