• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

通过大型模型进行视觉语言导航的可校正地标发现

Correctable Landmark Discovery via Large Models for Vision-Language Navigation.

作者信息

Lin Bingqian, Nie Yunshuang, Wei Ziming, Zhu Yi, Xu Hang, Ma Shikui, Liu Jianzhuang, Liang Xiaodan

出版信息

IEEE Trans Pattern Anal Mach Intell. 2024 Dec;46(12):8534-8548. doi: 10.1109/TPAMI.2024.3407759. Epub 2024 Nov 6.

DOI:10.1109/TPAMI.2024.3407759
PMID:38819971
Abstract

Vision-Language Navigation (VLN) requires the agent to follow language instructions to reach a target position. A key factor for successful navigation is to align the landmarks implied in the instruction with diverse visual observations. However, previous VLN agents fail to perform accurate modality alignment especially in unexplored scenes, since they learn from limited navigation data and lack sufficient open-world alignment knowledge. In this work, we propose a new VLN paradigm, called COrrectable LaNdmark DiScOvery via Large ModEls (CONSOLE). In CONSOLE, we cast VLN as an open-world sequential landmark discovery problem, by introducing a novel correctable landmark discovery scheme based on two large models ChatGPT and CLIP. Specifically, we use ChatGPT to provide rich open-world landmark cooccurrence commonsense, and conduct CLIP-driven landmark discovery based on these commonsense priors. To mitigate the noise in the priors due to the lack of visual constraints, we introduce a learnable cooccurrence scoring module, which corrects the importance of each cooccurrence according to actual observations for accurate landmark discovery. We further design an observation enhancement strategy for an elegant combination of our framework with different VLN agents, where we utilize the corrected landmark features to obtain enhanced observation features for action decision. Extensive experimental results on multiple popular VLN benchmarks (R2R, REVERIE, R4R, RxR) show the significant superiority of CONSOLE over strong baselines. Especially, our CONSOLE establishes the new state-of-the-art results on R2R and R4R in unseen scenarios.

摘要

视觉语言导航(VLN)要求智能体遵循语言指令到达目标位置。成功导航的一个关键因素是将指令中隐含的地标与各种视觉观察结果对齐。然而,以前的VLN智能体未能进行准确的模态对齐,尤其是在未探索的场景中,因为它们从有限的导航数据中学习,并且缺乏足够的开放世界对齐知识。在这项工作中,我们提出了一种新的VLN范式,称为通过大模型进行可校正地标发现(CONSOLE)。在CONSOLE中,我们将VLN视为一个开放世界的顺序地标发现问题,通过引入一种基于两个大模型ChatGPT和CLIP的新颖的可校正地标发现方案。具体来说,我们使用ChatGPT提供丰富的开放世界地标共现常识,并基于这些常识先验进行CLIP驱动的地标发现。为了减轻由于缺乏视觉约束而导致的先验中的噪声,我们引入了一个可学习的共现评分模块,该模块根据实际观察结果校正每个共现的重要性,以进行准确的地标发现。我们进一步设计了一种观察增强策略,以便将我们的框架与不同的VLN智能体进行优雅的组合,在这种策略中,我们利用校正后的地标特征来获得用于动作决策的增强观察特征。在多个流行的VLN基准测试(R2R、REVERIE、R4R、RxR)上的大量实验结果表明,CONSOLE比强大的基线具有显著优势。特别是,我们的CONSOLE在未见场景下的R2R和R4R上建立了新的最先进结果。

相似文献

1
Correctable Landmark Discovery via Large Models for Vision-Language Navigation.通过大型模型进行视觉语言导航的可校正地标发现
IEEE Trans Pattern Anal Mach Intell. 2024 Dec;46(12):8534-8548. doi: 10.1109/TPAMI.2024.3407759. Epub 2024 Nov 6.
2
Towards Deviation-Robust Agent Navigation via Perturbation-Aware Contrastive Learning.通过扰动感知对比学习实现抗偏差智能体导航
IEEE Trans Pattern Anal Mach Intell. 2023 Oct;45(10):12535-12549. doi: 10.1109/TPAMI.2023.3273594. Epub 2023 Sep 5.
3
Outdoor Vision-and-Language Navigation Needs Object-Level Alignment.户外视觉与语言导航需要目标级对齐。
Sensors (Basel). 2023 Jun 29;23(13):6028. doi: 10.3390/s23136028.
4
HOP+: History-Enhanced and Order-Aware Pre-Training for Vision-and-Language Navigation.HOP+:用于视觉语言导航的具有历史增强和顺序感知的预训练。
IEEE Trans Pattern Anal Mach Intell. 2023 Jul;45(7):8524-8537. doi: 10.1109/TPAMI.2023.3234243. Epub 2023 Jun 5.
5
Self-Supervised 3-D Semantic Representation Learning for Vision-and-Language Navigation.用于视觉与语言导航的自监督3D语义表征学习
IEEE Trans Neural Netw Learn Syst. 2025 Apr;36(4):6738-6751. doi: 10.1109/TNNLS.2024.3395633. Epub 2025 Apr 4.
6
Vision-Language Navigation With Beam-Constrained Global Normalization.具有光束约束全局归一化的视觉语言导航
IEEE Trans Neural Netw Learn Syst. 2024 Jan;35(1):1352-1363. doi: 10.1109/TNNLS.2022.3183287. Epub 2024 Jan 4.
7
Adversarial Reinforced Instruction Attacker for Robust Vision-Language Navigation.对抗强化指令攻击的鲁棒视觉-语言导航
IEEE Trans Pattern Anal Mach Intell. 2022 Oct;44(10):7175-7189. doi: 10.1109/TPAMI.2021.3097435. Epub 2022 Sep 14.
8
Room-Object Entity Prompting and Reasoning for Embodied Referring Expression.用于具身指代表达的房间-物体实体提示与推理
IEEE Trans Pattern Anal Mach Intell. 2024 Feb;46(2):994-1010. doi: 10.1109/TPAMI.2023.3326851. Epub 2024 Jan 8.
9
Visual Perception Generalization for Vision-and-Language Navigation via Meta-Learning.通过元学习实现视觉与语言导航的视觉感知泛化
IEEE Trans Neural Netw Learn Syst. 2023 Aug;34(8):5193-5199. doi: 10.1109/TNNLS.2021.3122579. Epub 2023 Aug 4.
10
ETPNav: Evolving Topological Planning for Vision-Language Navigation in Continuous Environments.ETPNav:连续环境中视觉语言导航的演进拓扑规划
IEEE Trans Pattern Anal Mach Intell. 2024 Apr 9;PP. doi: 10.1109/TPAMI.2024.3386695.