• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

TAMC:用于开放词汇3D场景理解的文本对齐与掩码一致性

TAMC: Textual Alignment and Masked Consistency for Open-Vocabulary 3D Scene Understanding.

作者信息

Wang Juan, Wang Zhijie, Miyazaki Tomo, Fan Yaohou, Omachi Shinichiro

机构信息

Department of Communications Engineering, Graduate School of Engineering, Tohoku University, Sendai 9808579, Japan.

RIKEN AIP, Tokyo 1030027, Japan.

出版信息

Sensors (Basel). 2024 Sep 24;24(19):6166. doi: 10.3390/s24196166.

DOI:10.3390/s24196166
PMID:39409206
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11478597/
Abstract

Three-dimensional (3D) Scene Understanding achieves environmental perception by extracting and analyzing point cloud data with wide applications including virtual reality, robotics, etc. Previous methods align the 2D image feature from a pre-trained CLIP model and the 3D point cloud feature for the open vocabulary scene understanding ability. We believe that existing methods have the following two deficiencies: (1) the 3D feature extraction process ignores the challenges of real scenarios, i.e., point cloud data are very sparse and even incomplete; (2) the training stage lacks direct text supervision, leading to inconsistency with the inference stage. To address the first issue, we employ a Masked Consistency training policy. Specifically, during the alignment of 3D and 2D features, we mask some 3D features to force the model to understand the entire scene using only partial 3D features. For the second issue, we generate pseudo-text labels and align them with the 3D features during the training process. In particular, we first generate a description for each 2D image belonging to the same 3D scene and then use a summarization model to fuse these descriptions into a single description of the scene. Subsequently, we align 2D-3D features and 3D-text features simultaneously during training. Massive experiments demonstrate the effectiveness of our method, outperforming state-of-the-art approaches.

摘要

三维(3D)场景理解通过提取和分析点云数据来实现环境感知,其应用广泛,包括虚拟现实、机器人技术等。先前的方法将预训练的CLIP模型中的二维图像特征与三维点云特征对齐,以实现开放词汇场景理解能力。我们认为现有方法存在以下两个缺陷:(1)三维特征提取过程忽略了真实场景的挑战,即点云数据非常稀疏甚至不完整;(2)训练阶段缺乏直接的文本监督,导致与推理阶段不一致。为了解决第一个问题,我们采用了掩码一致性训练策略。具体来说,在三维和二维特征对齐过程中,我们对一些三维特征进行掩码处理,迫使模型仅使用部分三维特征来理解整个场景。对于第二个问题,我们在训练过程中生成伪文本标签并将它们与三维特征对齐。特别是,我们首先为属于同一三维场景的每个二维图像生成一个描述,然后使用一个摘要模型将这些描述融合成一个场景的单一描述。随后,我们在训练过程中同时对齐二维-三维特征和三维-文本特征。大量实验证明了我们方法的有效性,优于现有最先进的方法。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e4e1/11478597/f7ed8b19bc32/sensors-24-06166-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e4e1/11478597/abe649dc287d/sensors-24-06166-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e4e1/11478597/7ae472ddc00e/sensors-24-06166-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e4e1/11478597/656b4b18504f/sensors-24-06166-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e4e1/11478597/f7ed8b19bc32/sensors-24-06166-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e4e1/11478597/abe649dc287d/sensors-24-06166-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e4e1/11478597/7ae472ddc00e/sensors-24-06166-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e4e1/11478597/656b4b18504f/sensors-24-06166-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e4e1/11478597/f7ed8b19bc32/sensors-24-06166-g004.jpg

相似文献

1
TAMC: Textual Alignment and Masked Consistency for Open-Vocabulary 3D Scene Understanding.TAMC:用于开放词汇3D场景理解的文本对齐与掩码一致性
Sensors (Basel). 2024 Sep 24;24(19):6166. doi: 10.3390/s24196166.
2
Lowis3D: Language-Driven Open-World Instance-Level 3D Scene Understanding.Lowis3D:语言驱动的开放世界实例级3D场景理解
IEEE Trans Pattern Anal Mach Intell. 2024 Dec;46(12):8517-8533. doi: 10.1109/TPAMI.2024.3410324. Epub 2024 Nov 6.
3
Text2NeRF: Text-Driven 3D Scene Generation With Neural Radiance Fields.Text2NeRF:基于神经辐射场的文本驱动3D场景生成
IEEE Trans Vis Comput Graph. 2024 Dec;30(12):7749-7762. doi: 10.1109/TVCG.2024.3361502. Epub 2024 Oct 28.
4
Learning Virtual View Selection for 3D Scene Semantic Segmentation.用于3D场景语义分割的学习虚拟视图选择
IEEE Trans Image Process. 2024;33:4159-4172. doi: 10.1109/TIP.2024.3421952. Epub 2024 Jul 16.
5
Self-Supervised 3D Scene Flow Estimation and Motion Prediction Using Local Rigidity Prior.基于局部刚性先验的自监督3D场景流估计与运动预测
IEEE Trans Pattern Anal Mach Intell. 2024 Dec;46(12):8106-8122. doi: 10.1109/TPAMI.2024.3401029. Epub 2024 Nov 6.
6
SSR-2D: Semantic 3D Scene Reconstruction From 2D Images.SSR-2D:从二维图像进行语义三维场景重建
IEEE Trans Pattern Anal Mach Intell. 2024 Dec;46(12):8486-8501. doi: 10.1109/TPAMI.2024.3410032. Epub 2024 Nov 6.
7
Complete 3D Relationships Extraction Modality Alignment Network for 3D Dense Captioning.用于3D密集字幕的完整3D关系提取模态对齐网络
IEEE Trans Vis Comput Graph. 2024 Aug;30(8):4867-4880. doi: 10.1109/TVCG.2023.3279204. Epub 2024 Jul 1.
8
Efficient 3D Scene Semantic Segmentation via Active Learning on Rendered 2D Images.基于渲染二维图像的主动学习的高效三维场景语义分割。
IEEE Trans Image Process. 2023;32:3521-3535. doi: 10.1109/TIP.2023.3286708. Epub 2023 Jun 29.
9
Image Understands Point Cloud: Weakly Supervised 3D Semantic Segmentation via Association Learning.图像理解点云:通过关联学习进行弱监督3D语义分割
IEEE Trans Image Process. 2024;33:1838-1852. doi: 10.1109/TIP.2024.3372449. Epub 2024 Mar 12.
10
Transfer Learning Based Semantic Segmentation for 3D Object Detection from Point Cloud.基于迁移学习的点云三维目标检测语义分割。
Sensors (Basel). 2021 Jun 8;21(12):3964. doi: 10.3390/s21123964.

本文引用的文献

1
Efficient Detection and Tracking of Human Using 3D LiDAR Sensor.利用 3D LiDAR 传感器实现高效的人体检测与跟踪。
Sensors (Basel). 2023 May 12;23(10):4720. doi: 10.3390/s23104720.
2
Point Cloud Hand-Object Segmentation Using Multimodal Imaging with Thermal and Color Data for Safe Robotic Object Handover.点云手-物体分割使用多模态成像与热和颜色数据进行安全机器人物体交接。
Sensors (Basel). 2021 Aug 23;21(16):5676. doi: 10.3390/s21165676.