TAMC：用于开放词汇3D场景理解的文本对齐与掩码一致性

TAMC: Textual Alignment and Masked Consistency for Open-Vocabulary 3D Scene Understanding.

作者信息

Wang Juan, Wang Zhijie, Miyazaki Tomo, Fan Yaohou, Omachi Shinichiro

机构信息

Department of Communications Engineering, Graduate School of Engineering, Tohoku University, Sendai 9808579, Japan.

RIKEN AIP, Tokyo 1030027, Japan.

出版信息

Sensors (Basel). 2024 Sep 24;24(19):6166. doi: 10.3390/s24196166.

DOI:10.3390/s24196166

PMID:39409206

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11478597/

Abstract

Three-dimensional (3D) Scene Understanding achieves environmental perception by extracting and analyzing point cloud data with wide applications including virtual reality, robotics, etc. Previous methods align the 2D image feature from a pre-trained CLIP model and the 3D point cloud feature for the open vocabulary scene understanding ability. We believe that existing methods have the following two deficiencies: (1) the 3D feature extraction process ignores the challenges of real scenarios, i.e., point cloud data are very sparse and even incomplete; (2) the training stage lacks direct text supervision, leading to inconsistency with the inference stage. To address the first issue, we employ a Masked Consistency training policy. Specifically, during the alignment of 3D and 2D features, we mask some 3D features to force the model to understand the entire scene using only partial 3D features. For the second issue, we generate pseudo-text labels and align them with the 3D features during the training process. In particular, we first generate a description for each 2D image belonging to the same 3D scene and then use a summarization model to fuse these descriptions into a single description of the scene. Subsequently, we align 2D-3D features and 3D-text features simultaneously during training. Massive experiments demonstrate the effectiveness of our method, outperforming state-of-the-art approaches.

摘要

三维（3D）场景理解通过提取和分析点云数据来实现环境感知，其应用广泛，包括虚拟现实、机器人技术等。先前的方法将预训练的CLIP模型中的二维图像特征与三维点云特征对齐，以实现开放词汇场景理解能力。我们认为现有方法存在以下两个缺陷：（1）三维特征提取过程忽略了真实场景的挑战，即点云数据非常稀疏甚至不完整；（2）训练阶段缺乏直接的文本监督，导致与推理阶段不一致。为了解决第一个问题，我们采用了掩码一致性训练策略。具体来说，在三维和二维特征对齐过程中，我们对一些三维特征进行掩码处理，迫使模型仅使用部分三维特征来理解整个场景。对于第二个问题，我们在训练过程中生成伪文本标签并将它们与三维特征对齐。特别是，我们首先为属于同一三维场景的每个二维图像生成一个描述，然后使用一个摘要模型将这些描述融合成一个场景的单一描述。随后，我们在训练过程中同时对齐二维-三维特征和三维-文本特征。大量实验证明了我们方法的有效性，优于现有最先进的方法。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

TAMC：用于开放词汇3D场景理解的文本对齐与掩码一致性

TAMC: Textual Alignment and Masked Consistency for Open-Vocabulary 3D Scene Understanding.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

TAMC：用于开放词汇3D场景理解的文本对齐与掩码一致性

TAMC: Textual Alignment and Masked Consistency for Open-Vocabulary 3D Scene Understanding.

作者信息

机构信息

出版信息

相似文献

本文引用的文献