• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

利用二维预训练视觉变换器通过掩码自动编码器生成三维模型。

Leveraging two-dimensional pre-trained vision transformers for three-dimensional model generation via masked autoencoders.

作者信息

Sajid Muhammad, Razzaq Malik Kaleem, Ur Rehman Ateeq, Safdar Malik Tauqeer, Alajmi Masoud, Haider Khan Ali, Haider Amir, Hussen Seada

机构信息

Department of Computer Science, Air University, Islamabad, 44230, Pakistan.

Computer Science and Engineering, Saveetha School of Engineering, Saveetha Institute of Medical and Technical Sciences, Chennai, Tamilnadu, India.

出版信息

Sci Rep. 2025 Jan 25;15(1):3164. doi: 10.1038/s41598-025-87376-y.

DOI:10.1038/s41598-025-87376-y
PMID:39863694
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11763031/
Abstract

Although the Transformer architecture has established itself as the industry standard for jobs involving natural language processing, it still has few uses in computer vision. In vision, attention is used in conjunction with convolutional networks or to replace individual convolutional network elements while preserving the overall network design. Differences between the two domains, such as significant variations in the scale of visual things and the higher granularity of pixels in images compared to words in the text, make it difficult to transfer Transformer from language to vision. Masking autoencoding is a promising self-supervised learning approach that greatly advances computer vision and natural language processing. For robust 2D representations, pre-training with large image data has become standard practice. On the other hand, the low availability of 3D datasets significantly impedes learning high-quality 3D features because of the high data processing cost. We present a strong multi-scale MAE prior training architecture that uses a trained ViT and a 3D representation model from 2D images to let 3D point clouds learn on their own. We employ the adept 2D information to direct a 3D masking-based autoencoder, which uses an encoder-decoder architecture to rebuild the masked point tokens through self-supervised pre-training. To acquire the input point cloud's multi-view visual characteristics, we first use pre-trained 2D models. Next, we present a two-dimensional masking method that preserves the visibility of semantically significant point tokens. Numerous tests demonstrate how effectively our method works with pre-trained models and how well it generalizes to a range of downstream tasks. In particular, our pre-trained model achieved 93.63% accuracy for linear SVM on ScanObjectNN and 91.31% accuracy on ModelNet40. Our approach demonstrates how a straightforward architecture solely based on conventional transformers may outperform specialized transformer models from supervised learning.

摘要

尽管Transformer架构已成为自然语言处理相关工作的行业标准,但在计算机视觉领域的应用仍较少。在视觉领域,注意力机制通常与卷积网络结合使用,或者在保留整体网络设计的同时替代单个卷积网络元素。两个领域之间的差异,例如视觉对象规模的显著变化以及图像中像素的粒度比文本中的单词更高,使得将Transformer从语言领域转移到视觉领域变得困难。掩码自动编码是一种很有前景的自监督学习方法,极大地推动了计算机视觉和自然语言处理的发展。对于强大的二维表示,使用大量图像数据进行预训练已成为标准做法。另一方面,由于数据处理成本高,三维数据集的低可用性严重阻碍了高质量三维特征的学习。我们提出了一种强大的多尺度MAE先验训练架构,该架构使用经过训练的ViT和来自二维图像的三维表示模型,让三维点云自主学习。我们利用熟练的二维信息来指导基于三维掩码的自动编码器,该自动编码器使用编码器-解码器架构通过自监督预训练来重建被掩码的点令牌。为了获取输入点云的多视图视觉特征,我们首先使用预训练的二维模型。接下来,我们提出一种二维掩码方法,该方法保留语义上重要的点令牌的可见性。大量测试表明我们的方法在与预训练模型配合使用时的有效性,以及它对一系列下游任务的泛化能力。特别是,我们的预训练模型在ScanObjectNN上的线性支持向量机准确率达到了93.63%,在ModelNet40上的准确率达到了91.31%。我们的方法展示了一个仅基于传统Transformer的简单架构如何能够超越监督学习中的专门Transformer模型。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7a64/11763031/d7b030929833/41598_2025_87376_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7a64/11763031/66be67c97552/41598_2025_87376_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7a64/11763031/04b2acc1f53a/41598_2025_87376_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7a64/11763031/d7b030929833/41598_2025_87376_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7a64/11763031/66be67c97552/41598_2025_87376_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7a64/11763031/04b2acc1f53a/41598_2025_87376_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7a64/11763031/d7b030929833/41598_2025_87376_Fig3_HTML.jpg

相似文献

1
Leveraging two-dimensional pre-trained vision transformers for three-dimensional model generation via masked autoencoders.利用二维预训练视觉变换器通过掩码自动编码器生成三维模型。
Sci Rep. 2025 Jan 25;15(1):3164. doi: 10.1038/s41598-025-87376-y.
2
Do it the transformer way: A comprehensive review of brain and vision transformers for autism spectrum disorder diagnosis and classification.采用变压器方法:自闭症谱系障碍诊断和分类的脑和视觉变压器的全面综述。
Comput Biol Med. 2023 Dec;167:107667. doi: 10.1016/j.compbiomed.2023.107667. Epub 2023 Nov 3.
3
RT-ViT: Real-Time Monocular Depth Estimation Using Lightweight Vision Transformers.RT-ViT:基于轻量级视觉Transformer 的实时单目深度估计。
Sensors (Basel). 2022 May 19;22(10):3849. doi: 10.3390/s22103849.
4
Seeking an optimal approach for Computer-aided Diagnosis of Pulmonary Embolism.寻求肺栓塞计算机辅助诊断的最佳方法。
Med Image Anal. 2024 Jan;91:102988. doi: 10.1016/j.media.2023.102988. Epub 2023 Oct 13.
5
What Does a Language-And-Vision Transformer See: The Impact of Semantic Information on Visual Representations.语言与视觉Transformer看到了什么:语义信息对视觉表征的影响。
Front Artif Intell. 2021 Dec 3;4:767971. doi: 10.3389/frai.2021.767971. eCollection 2021.
6
Mapping medical image-text to a joint space via masked modeling.通过掩蔽建模将医学图像-文本映射到联合空间。
Med Image Anal. 2024 Jan;91:103018. doi: 10.1016/j.media.2023.103018. Epub 2023 Nov 4.
7
Cross-Attention Based Multi-Resolution Feature Fusion Model for Self-Supervised Cervical OCT Image Classification.基于交叉注意力的多分辨率特征融合模型用于自监督宫颈光学相干断层扫描图像分类
IEEE/ACM Trans Comput Biol Bioinform. 2023 Jul-Aug;20(4):2541-2554. doi: 10.1109/TCBB.2023.3246979. Epub 2023 Aug 9.
8
Learning the heterogeneous representation of brain's structure from serial SEM images using a masked autoencoder.使用掩码自动编码器从连续扫描电子显微镜图像中学习大脑结构的异质表示。
Front Neuroinform. 2023 Jun 8;17:1118419. doi: 10.3389/fninf.2023.1118419. eCollection 2023.
9
Analyzing Transfer Learning of Vision Transformers for Interpreting Chest Radiography.分析视觉Transformer在解读胸部 X 光片方面的迁移学习。
J Digit Imaging. 2022 Dec;35(6):1445-1462. doi: 10.1007/s10278-022-00666-z. Epub 2022 Jul 11.
10
Foundation models in gastrointestinal endoscopic AI: Impact of architecture, pre-training approach and data efficiency.胃肠道内镜 AI 中的基础模型:架构、预训练方法和数据效率的影响。
Med Image Anal. 2024 Dec;98:103298. doi: 10.1016/j.media.2024.103298. Epub 2024 Aug 12.

引用本文的文献

1
A new approach of anomaly detection in shopping center surveillance videos for theft prevention based on RLCNN model.一种基于RLCNN模型的购物中心监控视频中用于防盗的异常检测新方法。
PeerJ Comput Sci. 2025 Jun 18;11:e2944. doi: 10.7717/peerj-cs.2944. eCollection 2025.
2
A hybrid steganography framework using DCT and GAN for secure data communication in the big data era.一种在大数据时代使用离散余弦变换(DCT)和生成对抗网络(GAN)的混合隐写术框架,用于安全数据通信。
Sci Rep. 2025 Jun 4;15(1):19630. doi: 10.1038/s41598-025-01054-7.

本文引用的文献

1
Accurate prediction of disease-risk factors from volumetric medical scans by a deep vision model pre-trained with 2D scans.通过使用二维扫描进行预训练的深度视觉模型,从容积医学扫描中准确预测疾病风险因素。
Nat Biomed Eng. 2025 Apr;9(4):507-520. doi: 10.1038/s41551-024-01257-9. Epub 2024 Oct 1.
2
CrossFormer++: A Versatile Vision Transformer Hinging on Cross-Scale Attention.CrossFormer++:一种基于跨尺度注意力的通用视觉Transformer
IEEE Trans Pattern Anal Mach Intell. 2024 May;46(5):3123-3136. doi: 10.1109/TPAMI.2023.3341806. Epub 2024 Apr 3.
3
Self-supervised Learning: A Succinct Review.
自监督学习:简要综述。
Arch Comput Methods Eng. 2023;30(4):2761-2775. doi: 10.1007/s11831-023-09884-2. Epub 2023 Jan 20.
4
VOLO: Vision Outlooker for Visual Recognition.VOLO:用于视觉识别的视觉展望器
IEEE Trans Pattern Anal Mach Intell. 2023 May;45(5):6575-6586. doi: 10.1109/TPAMI.2022.3206108. Epub 2023 Apr 3.
5
Deep High-Resolution Representation Learning for Visual Recognition.用于视觉识别的深度高分辨率表征学习
IEEE Trans Pattern Anal Mach Intell. 2021 Oct;43(10):3349-3364. doi: 10.1109/TPAMI.2020.2983686. Epub 2021 Sep 2.
6
3D Face Reconstruction with Geometry Details from a Single Image.基于单张图像的具有几何细节的3D人脸重建
IEEE Trans Image Process. 2018 Jun 8. doi: 10.1109/TIP.2018.2845697.
7
DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs.DeepLab:基于深度卷积网络、空洞卷积和全连接条件随机场的语义图像分割。
IEEE Trans Pattern Anal Mach Intell. 2018 Apr;40(4):834-848. doi: 10.1109/TPAMI.2017.2699184. Epub 2017 Apr 27.
8
Self-organizing neural network that discovers surfaces in random-dot stereograms.在随机点立体图中发现表面的自组织神经网络。
Nature. 1992 Jan 9;355(6356):161-3. doi: 10.1038/355161a0.