IEEE Trans Pattern Anal Mach Intell. 2021 Dec;43(12):4205-4216. doi: 10.1109/TPAMI.2020.2972281. Epub 2021 Nov 3.
Vision-language navigation (VLN) is the task of navigating an embodied agent to carry out natural language instructions inside real 3D environments. In this paper, we study how to address three critical challenges for this task: the cross-modal grounding, the ill-posed feedback, and the generalization problems. First, we propose a novel Reinforced Cross-Modal Matching (RCM) approach that enforces cross-modal grounding both locally and globally via reinforcement learning (RL). Particularly, a matching critic is used to provide an intrinsic reward to encourage global matching between instructions and trajectories, and a reasoning navigator is employed to perform cross-modal grounding in the local visual scene. Evaluation on a VLN benchmark dataset shows that our RCM model significantly outperforms baseline methods by 10 percent on Success Rate weighted by Path Length (SPL) and achieves the state-of-the-art performance. To improve the generalizability of the learned policy, we further introduce a Self-Supervised Imitation Learning (SIL) method to explore and adapt to unseen environments by imitating its own past, good decisions. We demonstrate that SIL can approximate a better and more efficient policy, which tremendously minimizes the success rate performance gap between seen and unseen environments (from 30.7 to 11.7 percent).
视觉-语言导航 (VLN) 是指让一个具身代理根据自然语言指令在真实的 3D 环境中导航。在本文中,我们研究了如何解决这个任务的三个关键挑战:跨模态基础、不适定反馈和泛化问题。首先,我们提出了一种新颖的强化跨模态匹配 (RCM) 方法,通过强化学习 (RL) 从局部和全局两个方面强制跨模态基础。具体来说,我们使用匹配评估器为指令和轨迹之间的全局匹配提供内在奖励,使用推理导航器在局部视觉场景中执行跨模态基础。在 VLN 基准数据集上的评估表明,我们的 RCM 模型在路径长度加权成功率 (SPL) 上比基线方法高出 10%,达到了最先进的性能。为了提高所学习策略的泛化能力,我们进一步引入了一种自我监督模仿学习 (SIL) 方法,通过模仿自己的过去、好的决策来探索和适应看不见的环境。我们证明了 SIL 可以逼近更好、更有效的策略,大大缩小了可见和不可见环境之间的成功率差距(从 30.7%降至 11.7%)。