Lin Bingqian, Long Yanxin, Zhu Yi, Zhu Fengda, Liang Xiaodan, Ye Qixiang, Lin Liang
IEEE Trans Pattern Anal Mach Intell. 2023 Oct;45(10):12535-12549. doi: 10.1109/TPAMI.2023.3273594. Epub 2023 Sep 5.
Vision-and-language navigation (VLN) asks an agent to follow a given language instruction to navigate through a real 3D environment. Despite significant advances, conventional VLN agents are trained typically under disturbance-free environments and may easily fail in real-world navigation scenarios, since they are unaware of how to deal with various possible disturbances, such as sudden obstacles or human interruptions, which widely exist and may usually cause an unexpected route deviation. In this paper, we present a model-agnostic training paradigm, called Progressive Perturbation-aware Contrastive Learning (PROPER) to enhance the generalization ability of existing VLN agents to the real world, by requiring them to learn towards deviation-robust navigation. Specifically, a simple yet effective path perturbation scheme is introduced to implement the route deviation, with which the agent is required to still navigate successfully following the original instruction. Since directly enforcing the agent to learn perturbed trajectories may lead to insufficient and inefficient training, a progressively perturbed trajectory augmentation strategy is designed, where the agent can self-adaptively learn to navigate under perturbation with the improvement of its navigation performance for each specific trajectory. For encouraging the agent to well capture the difference brought by perturbation and adapt to both perturbation-free and perturbation-based environments, a perturbation-aware contrastive learning mechanism is further developed by contrasting perturbation-free trajectory encodings and perturbation-based counterparts. Extensive experiments on the standard Room-to-Room (R2R) benchmark show that PROPER can benefit multiple state-of-the-art VLN baselines in perturbation-free scenarios. We further collect the perturbed path data to construct an introspection subset based on the R2R, called Path-Perturbed R2R (PP-R2R). The results on PP-R2R show unsatisfying robustness of popular VLN agents and the capability of PROPER in improving the navigation robustness under deviation.
视觉与语言导航(VLN)要求智能体遵循给定的语言指令在真实的三维环境中导航。尽管取得了显著进展,但传统的VLN智能体通常是在无干扰环境下训练的,在现实世界的导航场景中可能很容易失败,因为它们不知道如何应对各种可能的干扰,比如突然出现的障碍物或人为干扰,这些干扰广泛存在且通常会导致意外的路线偏差。在本文中,我们提出了一种与模型无关的训练范式,称为渐进扰动感知对比学习(PROPER),以提高现有VLN智能体对现实世界的泛化能力,方法是要求它们学习抗偏差的导航。具体来说,引入了一种简单而有效的路径扰动方案来实现路线偏差,要求智能体在这种情况下仍能按照原始指令成功导航。由于直接强制智能体学习受扰动的轨迹可能会导致训练不足和效率低下,因此设计了一种渐进式受扰动轨迹增强策略,在该策略中,随着智能体在每个特定轨迹上导航性能的提高,它可以自适应地学习在扰动下导航。为了鼓励智能体很好地捕捉扰动带来的差异并适应无扰动和基于扰动的环境,通过对比无扰动轨迹编码和基于扰动的对应编码,进一步开发了一种扰动感知对比学习机制。在标准的“房间到房间”(R2R)基准上进行的大量实验表明,PROPER可以在无扰动场景中使多个先进的VLN基线受益。我们进一步收集受扰动的路径数据,基于R2R构建一个自省子集,称为路径扰动R2R(PP-R2R)。在PP-R2R上的结果表明,流行的VLN智能体的鲁棒性不尽人意,而PROPER具有提高偏差情况下导航鲁棒性的能力。