Suppr超能文献

基于多模态数据的室内场景字幕生成

Indoor Scene Change Captioning Based on Multimodality Data.

机构信息

Graduate School of Science and Technology, University of Tsukuba, Tsukuba 305-8577, Japan.

National Institute of Advanced Industrial Science and Technology (AIST), Tsukuba 305-8560, Japan.

出版信息

Sensors (Basel). 2020 Aug 23;20(17):4761. doi: 10.3390/s20174761.

Abstract

This study proposes a framework for describing a scene change using natural language text based on indoor scene observations conducted before and after a scene change. The recognition of scene changes plays an essential role in a variety of real-world applications, such as scene anomaly detection. Most scene understanding research has focused on static scenes. Most existing scene change captioning methods detect scene changes from single-view RGB images, neglecting the underlying three-dimensional structures. Previous three-dimensional scene change captioning methods use simulated scenes consisting of geometry primitives, making it unsuitable for real-world applications. To solve these problems, we automatically generated large-scale indoor scene change caption datasets. We propose an end-to-end framework for describing scene changes from various input modalities, namely, RGB images, depth images, and point cloud data, which are available in most robot applications. We conducted experiments with various input modalities and models and evaluated model performance using datasets with various levels of complexity. Experimental results show that the models that combine RGB images and point cloud data as input achieve high performance in sentence generation and caption correctness and are robust for change type understanding for datasets with high complexity. The developed datasets and models contribute to the study of indoor scene change understanding.

摘要

本研究提出了一种使用基于室内场景观察的自然语言文本描述场景变化的框架,这些观察是在场景变化前后进行的。场景变化的识别在各种实际应用中起着至关重要的作用,例如场景异常检测。大多数场景理解研究都集中在静态场景上。大多数现有的场景变化字幕生成方法从单视图 RGB 图像中检测场景变化,忽略了潜在的三维结构。以前的三维场景变化字幕生成方法使用由几何基元组成的模拟场景,因此不适合实际应用。为了解决这些问题,我们自动生成了大规模的室内场景变化字幕数据集。我们提出了一个端到端框架,用于从各种输入模式(即 RGB 图像、深度图像和点云数据)描述场景变化,这些输入模式在大多数机器人应用中都可用。我们使用各种输入模式和模型进行了实验,并使用具有不同复杂度级别的数据集评估了模型性能。实验结果表明,结合 RGB 图像和点云数据作为输入的模型在句子生成和字幕正确性方面表现出了很高的性能,并且对于具有高复杂度数据集的变化类型理解具有很强的鲁棒性。所开发的数据集和模型有助于室内场景变化理解的研究。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验