具有光束约束全局归一化的视觉语言导航

Vision-Language Navigation With Beam-Constrained Global Normalization.

作者信息

Xie Liang, Zhang Meishan, Li You, Qin Wei, Yan Ye, Yin Erwei

出版信息

IEEE Trans Neural Netw Learn Syst. 2024 Jan;35(1):1352-1363. doi: 10.1109/TNNLS.2022.3183287. Epub 2024 Jan 4.

DOI:10.1109/TNNLS.2022.3183287

Abstract

Vision-language navigation (VLN) is a challenging task, which guides an agent to navigate in a realistic environment by natural language instructions. Sequence-to-sequence modeling is one of the most prospective architectures for the task, which achieves the agent navigation goal by a sequence of moving actions. The line of work has led to the state-of-the-art performance. Recently, several studies showed that the beam-search decoding during the inference can result in promising performance, as it ranks multiple candidate trajectories by scoring each trajectory as a whole. However, the trajectory-level score might be seriously biased during ranking. The score is a simple averaging of individual unit scores of the target-sequence actions, and these unit scores could be incomparable among different trajectories since they are calculated by a local discriminant classifier. To address this problem, we propose a global normalization strategy to rescale the scores at the trajectory level. Concretely, we present two global score functions to rerank all candidates in the output beam, resulting in more comparable trajectory scores. In this way, the bias problem can be greatly alleviated. We conduct experiments on the benchmark room-to-room (R2R) dataset of VLN to verify our method, and the results show that the proposed global method is effective, providing significant performance than the corresponding baselines. Our final model can achieve competitive performance on the VLN leaderboard.

摘要

视觉语言导航（VLN）是一项具有挑战性的任务，它通过自然语言指令引导智能体在现实环境中导航。序列到序列建模是该任务最具前景的架构之一，它通过一系列移动动作实现智能体的导航目标。这一系列工作已经取得了当前最优的性能。最近，一些研究表明，推理过程中的束搜索解码可以带来不错的性能，因为它通过对每个轨迹进行整体评分来对多个候选轨迹进行排序。然而，在排序过程中，轨迹级别的分数可能会存在严重偏差。该分数是目标序列动作的各个单元分数的简单平均，并且由于这些单元分数是由局部判别分类器计算得出的，所以不同轨迹之间的这些单元分数可能无法比较。为了解决这个问题，我们提出了一种全局归一化策略，在轨迹级别重新调整分数。具体来说，我们提出了两个全局分数函数，对输出束中的所有候选轨迹进行重新排序，从而得到更具可比性的轨迹分数。通过这种方式，偏差问题可以得到极大缓解。我们在VLN的基准逐室（R2R）数据集上进行实验以验证我们的方法，结果表明所提出的全局方法是有效的，比相应的基线方法具有显著的性能提升。我们的最终模型在VLN排行榜上可以取得有竞争力的性能。

相似文献

Vision-Language Navigation With Beam-Constrained Global Normalization.

IEEE Trans Neural Netw Learn Syst. 2024 Jan;35(1):1352-1363. doi: 10.1109/TNNLS.2022.3183287. Epub 2024 Jan 4.

Towards Deviation-Robust Agent Navigation via Perturbation-Aware Contrastive Learning.

IEEE Trans Pattern Anal Mach Intell. 2023 Oct;45(10):12535-12549. doi: 10.1109/TPAMI.2023.3273594. Epub 2023 Sep 5.

Joint Multimodal Embedding and Backtracking Search in Vision-and-Language Navigation.

Sensors (Basel). 2021 Feb 2;21(3):1012. doi: 10.3390/s21031012.

Outdoor Vision-and-Language Navigation Needs Object-Level Alignment.

Sensors (Basel). 2023 Jun 29;23(13):6028. doi: 10.3390/s23136028.

Correctable Landmark Discovery via Large Models for Vision-Language Navigation.

IEEE Trans Pattern Anal Mach Intell. 2024 Dec;46(12):8534-8548. doi: 10.1109/TPAMI.2024.3407759. Epub 2024 Nov 6.

HOP+: History-Enhanced and Order-Aware Pre-Training for Vision-and-Language Navigation.

IEEE Trans Pattern Anal Mach Intell. 2023 Jul;45(7):8524-8537. doi: 10.1109/TPAMI.2023.3234243. Epub 2023 Jun 5.

ETPNav: Evolving Topological Planning for Vision-Language Navigation in Continuous Environments.

IEEE Trans Pattern Anal Mach Intell. 2024 Apr 9;PP. doi: 10.1109/TPAMI.2024.3386695.

Vision-Language Navigation Policy Learning and Adaptation.

IEEE Trans Pattern Anal Mach Intell. 2021 Dec;43(12):4205-4216. doi: 10.1109/TPAMI.2020.2972281. Epub 2021 Nov 3.

Discovering Intrinsic Subgoals for Vision- and-Language Navigation via Hierarchical Reinforcement Learning.

IEEE Trans Neural Netw Learn Syst. 2025 Apr;36(4):6516-6528. doi: 10.1109/TNNLS.2024.3398300. Epub 2025 Apr 4.

Visual Perception Generalization for Vision-and-Language Navigation via Meta-Learning.

IEEE Trans Neural Netw Learn Syst. 2023 Aug;34(8):5193-5199. doi: 10.1109/TNNLS.2021.3122579. Epub 2023 Aug 4.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

具有光束约束全局归一化的视觉语言导航

Vision-Language Navigation With Beam-Constrained Global Normalization.

作者信息

出版信息

相似文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献