Mi Jinpeng, Lyu Jianzhi, Tang Song, Li Qingdu, Zhang Jianwei
Institute of Machine Intelligence (IMI), University of Shanghai for Science and Technology, Shanghai, China.
Technical Aspects of Multimodal Systems, Department of Informatics, University of Hamburg, Hamburg, Germany.
Front Neurorobot. 2020 Jun 25;14:43. doi: 10.3389/fnbot.2020.00043. eCollection 2020.
Natural language provides an intuitive and effective interaction interface between human beings and robots. Currently, multiple approaches are presented to address natural language visual grounding for human-robot interaction. However, most of the existing approaches handle the ambiguity of natural language queries and achieve target objects grounding via dialogue systems, which make the interactions cumbersome and time-consuming. In contrast, we address interactive natural language grounding without auxiliary information. Specifically, we first propose a referring expression comprehension network to ground natural referring expressions. The referring expression comprehension network excavates the visual semantics via a visual semantic-aware network, and exploits the rich linguistic contexts in expressions by a language attention network. Furthermore, we combine the referring expression comprehension network with scene graph parsing to achieve unrestricted and complicated natural language grounding. Finally, we validate the performance of the referring expression comprehension network on three public datasets, and we also evaluate the effectiveness of the interactive natural language grounding architecture by conducting extensive natural language query groundings in different household scenarios.
自然语言为人与机器人之间提供了直观且有效的交互界面。当前,人们提出了多种方法来解决人机交互中的自然语言视觉基础问题。然而,现有的大多数方法处理自然语言查询的模糊性,并通过对话系统实现目标对象的基础定位,这使得交互变得繁琐且耗时。相比之下,我们在没有辅助信息的情况下解决交互式自然语言基础问题。具体而言,我们首先提出一个指代表达理解网络来定位自然指代表达。指代表达理解网络通过视觉语义感知网络挖掘视觉语义,并通过语言注意力网络利用表达式中丰富的语言上下文。此外,我们将指代表达理解网络与场景图解析相结合,以实现无限制且复杂的自然语言基础定位。最后,我们在三个公共数据集上验证了指代表达理解网络的性能,并且还通过在不同家庭场景中进行广泛的自然语言查询基础定位来评估交互式自然语言基础定位架构的有效性。