用于具身指代表达的房间-物体实体提示与推理

Room-Object Entity Prompting and Reasoning for Embodied Referring Expression.

作者信息

Gao Chen, Liu Si, Chen Jinyu, Wang Luting, Wu Qi, Li Bo, Tian Qi

出版信息

IEEE Trans Pattern Anal Mach Intell. 2024 Feb;46(2):994-1010. doi: 10.1109/TPAMI.2023.3326851. Epub 2024 Jan 8.

DOI:10.1109/TPAMI.2023.3326851

Abstract

Given a high-level instruction, the task of Embodied Referring Expression (REVERIE) requires an embodied agent to localise a remote referred object via navigating in the unseen environment. Previous vision-language navigation methods utilise the provided fine-grained instruction as step-by-step navigation guidance to conduct strict instruction-following, while REVERIE aims to achieve efficient goal-oriented exploration according to the high-level command. In this work, we propose a Cross-modal Knowledge Reasoning (abbreviated as CKR+) framework, which incorporates the prior knowledge as decision guidance to learn the navigation scheme comprehensively. Specifically, we design a Room-Object Aware (ROA) mechanism to explicitly decouple the room- and object-related clues from instruction and visual observations. Moreover, we propose a Knowledge-enabled Entity Relation Reasoning (KERR+) module to leverage the structured knowledge from the knowledge graph explicitly and unstructured knowledge from pre-trained model implicitly, to learn the internal-external correlations among room- and object-entities for the agent to make proper decisions. We devise an Entity Prompter (EP) that embeds in the KERR+ module, which utilises the navigation history and visual entities as prompts to transfer knowledge from the pre-trained CLIP model. In addition, we develop a Reinforced End Decider (RED) to learn the stopping scheme specifically, which is achieved by a customised reinforcement learning strategy and knowledge enhanced matching. Two techniques are also introduced to improve navigation efficiency further. Extensive experiments conducted on the REVERIE benchmark demonstrate the effectiveness and superiority of our proposed methods, which boosts the key metrics, i.e., SPL and REVERIE-success rate, to 14.46% and 13.81% respectively.

摘要

给定一个高级指令，具身指代表达（REVERIE）任务要求具身智能体在不可见的环境中导航，以定位远程被指代的物体。先前的视觉语言导航方法利用提供的细粒度指令作为逐步导航指导，以进行严格的指令遵循，而REVERIE旨在根据高级命令实现高效的目标导向探索。在这项工作中，我们提出了一种跨模态知识推理（简称为CKR+）框架，该框架将先验知识作为决策指导，以全面学习导航方案。具体而言，我们设计了一种房间-物体感知（ROA）机制，以从指令和视觉观察中明确解耦与房间和物体相关的线索。此外，我们提出了一种知识驱动的实体关系推理（KERR+）模块，以显式利用来自知识图谱的结构化知识和隐式利用来自预训练模型的非结构化知识，来学习房间和物体实体之间的内部-外部相关性，以便智能体做出正确决策。我们设计了一个嵌入在KERR+模块中的实体提示器（EP），它利用导航历史和视觉实体作为提示，从预训练的CLIP模型中转移知识。此外，我们开发了一个强化结束判定器（RED）来专门学习停止方案，这是通过定制的强化学习策略和知识增强匹配实现的。还引入了两种技术来进一步提高导航效率。在REVERIE基准上进行的大量实验证明了我们提出的方法的有效性和优越性，将关键指标，即成功率（SPL）和REVERIE成功率分别提高到了14.46%和13.81%。

相似文献

Room-Object Entity Prompting and Reasoning for Embodied Referring Expression.用于具身指代表达的房间-物体实体提示与推理

IEEE Trans Pattern Anal Mach Intell. 2024 Feb;46(2):994-1010. doi: 10.1109/TPAMI.2023.3326851. Epub 2024 Jan 8.

Vision-Language Navigation Policy Learning and Adaptation.视觉-语言导航策略学习与适应。

IEEE Trans Pattern Anal Mach Intell. 2021 Dec;43(12):4205-4216. doi: 10.1109/TPAMI.2020.2972281. Epub 2021 Nov 3.

Correctable Landmark Discovery via Large Models for Vision-Language Navigation.通过大型模型进行视觉语言导航的可校正地标发现

IEEE Trans Pattern Anal Mach Intell. 2024 Dec;46(12):8534-8548. doi: 10.1109/TPAMI.2024.3407759. Epub 2024 Nov 6.

Outdoor Vision-and-Language Navigation Needs Object-Level Alignment.户外视觉与语言导航需要目标级对齐。

Sensors (Basel). 2023 Jun 29;23(13):6028. doi: 10.3390/s23136028.

Learning Heterogeneous Relation Graph and Value Regularization Policy for Visual Navigation.用于视觉导航的异质关系图学习与价值正则化策略

IEEE Trans Neural Netw Learn Syst. 2024 Nov;35(11):16901-16915. doi: 10.1109/TNNLS.2023.3300888. Epub 2024 Oct 29.

Knowledge-Based Embodied Question Answering.

IEEE Trans Pattern Anal Mach Intell. 2023 Oct;45(10):11948-11960. doi: 10.1109/TPAMI.2023.3277206. Epub 2023 Sep 5.

HOP+: History-Enhanced and Order-Aware Pre-Training for Vision-and-Language Navigation.HOP+：用于视觉语言导航的具有历史增强和顺序感知的预训练。

IEEE Trans Pattern Anal Mach Intell. 2023 Jul;45(7):8524-8537. doi: 10.1109/TPAMI.2023.3234243. Epub 2023 Jun 5.

To Boost Zero-Shot Generalization for Embodied Reasoning With Vision-Language Pre-Training.

IEEE Trans Image Process. 2024;33:5370-5381. doi: 10.1109/TIP.2024.3459800. Epub 2024 Oct 2.

Adversarial Reinforced Instruction Attacker for Robust Vision-Language Navigation.对抗强化指令攻击的鲁棒视觉-语言导航

IEEE Trans Pattern Anal Mach Intell. 2022 Oct;44(10):7175-7189. doi: 10.1109/TPAMI.2021.3097435. Epub 2022 Sep 14.

Leveraging Predictions of Task-Related Latents for Interactive Visual Navigation.利用任务相关潜在因素的预测进行交互式视觉导航。

IEEE Trans Neural Netw Learn Syst. 2025 Jan;36(1):704-717. doi: 10.1109/TNNLS.2023.3335416. Epub 2025 Jan 7.

用于具身指代表达的房间-物体实体提示与推理

Room-Object Entity Prompting and Reasoning for Embodied Referring Expression.

作者信息

Gao Chen, Liu Si, Chen Jinyu, Wang Luting, Wu Qi, Li Bo, Tian Qi

出版信息

IEEE Trans Pattern Anal Mach Intell. 2024 Feb;46(2):994-1010. doi: 10.1109/TPAMI.2023.3326851. Epub 2024 Jan 8.

DOI:10.1109/TPAMI.2023.3326851

PMID:37871097

Abstract

摘要

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

用于具身指代表达的房间-物体实体提示与推理

Room-Object Entity Prompting and Reasoning for Embodied Referring Expression.

作者信息

出版信息

相似文献

用于具身指代表达的房间-物体实体提示与推理

Room-Object Entity Prompting and Reasoning for Embodied Referring Expression.

作者信息

出版信息

相似文献