Gao Chen, Liu Si, Chen Jinyu, Wang Luting, Wu Qi, Li Bo, Tian Qi
IEEE Trans Pattern Anal Mach Intell. 2024 Feb;46(2):994-1010. doi: 10.1109/TPAMI.2023.3326851. Epub 2024 Jan 8.
Given a high-level instruction, the task of Embodied Referring Expression (REVERIE) requires an embodied agent to localise a remote referred object via navigating in the unseen environment. Previous vision-language navigation methods utilise the provided fine-grained instruction as step-by-step navigation guidance to conduct strict instruction-following, while REVERIE aims to achieve efficient goal-oriented exploration according to the high-level command. In this work, we propose a Cross-modal Knowledge Reasoning (abbreviated as CKR+) framework, which incorporates the prior knowledge as decision guidance to learn the navigation scheme comprehensively. Specifically, we design a Room-Object Aware (ROA) mechanism to explicitly decouple the room- and object-related clues from instruction and visual observations. Moreover, we propose a Knowledge-enabled Entity Relation Reasoning (KERR+) module to leverage the structured knowledge from the knowledge graph explicitly and unstructured knowledge from pre-trained model implicitly, to learn the internal-external correlations among room- and object-entities for the agent to make proper decisions. We devise an Entity Prompter (EP) that embeds in the KERR+ module, which utilises the navigation history and visual entities as prompts to transfer knowledge from the pre-trained CLIP model. In addition, we develop a Reinforced End Decider (RED) to learn the stopping scheme specifically, which is achieved by a customised reinforcement learning strategy and knowledge enhanced matching. Two techniques are also introduced to improve navigation efficiency further. Extensive experiments conducted on the REVERIE benchmark demonstrate the effectiveness and superiority of our proposed methods, which boosts the key metrics, i.e., SPL and REVERIE-success rate, to 14.46% and 13.81% respectively.
给定一个高级指令,具身指代表达(REVERIE)任务要求具身智能体在不可见的环境中导航,以定位远程被指代的物体。先前的视觉语言导航方法利用提供的细粒度指令作为逐步导航指导,以进行严格的指令遵循,而REVERIE旨在根据高级命令实现高效的目标导向探索。在这项工作中,我们提出了一种跨模态知识推理(简称为CKR+)框架,该框架将先验知识作为决策指导,以全面学习导航方案。具体而言,我们设计了一种房间-物体感知(ROA)机制,以从指令和视觉观察中明确解耦与房间和物体相关的线索。此外,我们提出了一种知识驱动的实体关系推理(KERR+)模块,以显式利用来自知识图谱的结构化知识和隐式利用来自预训练模型的非结构化知识,来学习房间和物体实体之间的内部-外部相关性,以便智能体做出正确决策。我们设计了一个嵌入在KERR+模块中的实体提示器(EP),它利用导航历史和视觉实体作为提示,从预训练的CLIP模型中转移知识。此外,我们开发了一个强化结束判定器(RED)来专门学习停止方案,这是通过定制的强化学习策略和知识增强匹配实现的。还引入了两种技术来进一步提高导航效率。在REVERIE基准上进行的大量实验证明了我们提出的方法的有效性和优越性,将关键指标,即成功率(SPL)和REVERIE成功率分别提高到了14.46%和13.81%。