Lin Bingqian, Nie Yunshuang, Wei Ziming, Zhu Yi, Xu Hang, Ma Shikui, Liu Jianzhuang, Liang Xiaodan
IEEE Trans Pattern Anal Mach Intell. 2024 Dec;46(12):8534-8548. doi: 10.1109/TPAMI.2024.3407759. Epub 2024 Nov 6.
Vision-Language Navigation (VLN) requires the agent to follow language instructions to reach a target position. A key factor for successful navigation is to align the landmarks implied in the instruction with diverse visual observations. However, previous VLN agents fail to perform accurate modality alignment especially in unexplored scenes, since they learn from limited navigation data and lack sufficient open-world alignment knowledge. In this work, we propose a new VLN paradigm, called COrrectable LaNdmark DiScOvery via Large ModEls (CONSOLE). In CONSOLE, we cast VLN as an open-world sequential landmark discovery problem, by introducing a novel correctable landmark discovery scheme based on two large models ChatGPT and CLIP. Specifically, we use ChatGPT to provide rich open-world landmark cooccurrence commonsense, and conduct CLIP-driven landmark discovery based on these commonsense priors. To mitigate the noise in the priors due to the lack of visual constraints, we introduce a learnable cooccurrence scoring module, which corrects the importance of each cooccurrence according to actual observations for accurate landmark discovery. We further design an observation enhancement strategy for an elegant combination of our framework with different VLN agents, where we utilize the corrected landmark features to obtain enhanced observation features for action decision. Extensive experimental results on multiple popular VLN benchmarks (R2R, REVERIE, R4R, RxR) show the significant superiority of CONSOLE over strong baselines. Especially, our CONSOLE establishes the new state-of-the-art results on R2R and R4R in unseen scenarios.
视觉语言导航(VLN)要求智能体遵循语言指令到达目标位置。成功导航的一个关键因素是将指令中隐含的地标与各种视觉观察结果对齐。然而,以前的VLN智能体未能进行准确的模态对齐,尤其是在未探索的场景中,因为它们从有限的导航数据中学习,并且缺乏足够的开放世界对齐知识。在这项工作中,我们提出了一种新的VLN范式,称为通过大模型进行可校正地标发现(CONSOLE)。在CONSOLE中,我们将VLN视为一个开放世界的顺序地标发现问题,通过引入一种基于两个大模型ChatGPT和CLIP的新颖的可校正地标发现方案。具体来说,我们使用ChatGPT提供丰富的开放世界地标共现常识,并基于这些常识先验进行CLIP驱动的地标发现。为了减轻由于缺乏视觉约束而导致的先验中的噪声,我们引入了一个可学习的共现评分模块,该模块根据实际观察结果校正每个共现的重要性,以进行准确的地标发现。我们进一步设计了一种观察增强策略,以便将我们的框架与不同的VLN智能体进行优雅的组合,在这种策略中,我们利用校正后的地标特征来获得用于动作决策的增强观察特征。在多个流行的VLN基准测试(R2R、REVERIE、R4R、RxR)上的大量实验结果表明,CONSOLE比强大的基线具有显著优势。特别是,我们的CONSOLE在未见场景下的R2R和R4R上建立了新的最先进结果。