Suppr超能文献

通过大型模型进行视觉语言导航的可校正地标发现

Correctable Landmark Discovery via Large Models for Vision-Language Navigation.

作者信息

Lin Bingqian, Nie Yunshuang, Wei Ziming, Zhu Yi, Xu Hang, Ma Shikui, Liu Jianzhuang, Liang Xiaodan

出版信息

IEEE Trans Pattern Anal Mach Intell. 2024 Dec;46(12):8534-8548. doi: 10.1109/TPAMI.2024.3407759. Epub 2024 Nov 6.

Abstract

Vision-Language Navigation (VLN) requires the agent to follow language instructions to reach a target position. A key factor for successful navigation is to align the landmarks implied in the instruction with diverse visual observations. However, previous VLN agents fail to perform accurate modality alignment especially in unexplored scenes, since they learn from limited navigation data and lack sufficient open-world alignment knowledge. In this work, we propose a new VLN paradigm, called COrrectable LaNdmark DiScOvery via Large ModEls (CONSOLE). In CONSOLE, we cast VLN as an open-world sequential landmark discovery problem, by introducing a novel correctable landmark discovery scheme based on two large models ChatGPT and CLIP. Specifically, we use ChatGPT to provide rich open-world landmark cooccurrence commonsense, and conduct CLIP-driven landmark discovery based on these commonsense priors. To mitigate the noise in the priors due to the lack of visual constraints, we introduce a learnable cooccurrence scoring module, which corrects the importance of each cooccurrence according to actual observations for accurate landmark discovery. We further design an observation enhancement strategy for an elegant combination of our framework with different VLN agents, where we utilize the corrected landmark features to obtain enhanced observation features for action decision. Extensive experimental results on multiple popular VLN benchmarks (R2R, REVERIE, R4R, RxR) show the significant superiority of CONSOLE over strong baselines. Especially, our CONSOLE establishes the new state-of-the-art results on R2R and R4R in unseen scenarios.

摘要

视觉语言导航(VLN)要求智能体遵循语言指令到达目标位置。成功导航的一个关键因素是将指令中隐含的地标与各种视觉观察结果对齐。然而,以前的VLN智能体未能进行准确的模态对齐,尤其是在未探索的场景中,因为它们从有限的导航数据中学习,并且缺乏足够的开放世界对齐知识。在这项工作中,我们提出了一种新的VLN范式,称为通过大模型进行可校正地标发现(CONSOLE)。在CONSOLE中,我们将VLN视为一个开放世界的顺序地标发现问题,通过引入一种基于两个大模型ChatGPT和CLIP的新颖的可校正地标发现方案。具体来说,我们使用ChatGPT提供丰富的开放世界地标共现常识,并基于这些常识先验进行CLIP驱动的地标发现。为了减轻由于缺乏视觉约束而导致的先验中的噪声,我们引入了一个可学习的共现评分模块,该模块根据实际观察结果校正每个共现的重要性,以进行准确的地标发现。我们进一步设计了一种观察增强策略,以便将我们的框架与不同的VLN智能体进行优雅的组合,在这种策略中,我们利用校正后的地标特征来获得用于动作决策的增强观察特征。在多个流行的VLN基准测试(R2R、REVERIE、R4R、RxR)上的大量实验结果表明,CONSOLE比强大的基线具有显著优势。特别是,我们的CONSOLE在未见场景下的R2R和R4R上建立了新的最先进结果。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验