Sajid Muhammad, Razzaq Malik Kaleem, Ur Rehman Ateeq, Safdar Malik Tauqeer, Alajmi Masoud, Haider Khan Ali, Haider Amir, Hussen Seada
Department of Computer Science, Air University, Islamabad, 44230, Pakistan.
Computer Science and Engineering, Saveetha School of Engineering, Saveetha Institute of Medical and Technical Sciences, Chennai, Tamilnadu, India.
Sci Rep. 2025 Jan 25;15(1):3164. doi: 10.1038/s41598-025-87376-y.
Although the Transformer architecture has established itself as the industry standard for jobs involving natural language processing, it still has few uses in computer vision. In vision, attention is used in conjunction with convolutional networks or to replace individual convolutional network elements while preserving the overall network design. Differences between the two domains, such as significant variations in the scale of visual things and the higher granularity of pixels in images compared to words in the text, make it difficult to transfer Transformer from language to vision. Masking autoencoding is a promising self-supervised learning approach that greatly advances computer vision and natural language processing. For robust 2D representations, pre-training with large image data has become standard practice. On the other hand, the low availability of 3D datasets significantly impedes learning high-quality 3D features because of the high data processing cost. We present a strong multi-scale MAE prior training architecture that uses a trained ViT and a 3D representation model from 2D images to let 3D point clouds learn on their own. We employ the adept 2D information to direct a 3D masking-based autoencoder, which uses an encoder-decoder architecture to rebuild the masked point tokens through self-supervised pre-training. To acquire the input point cloud's multi-view visual characteristics, we first use pre-trained 2D models. Next, we present a two-dimensional masking method that preserves the visibility of semantically significant point tokens. Numerous tests demonstrate how effectively our method works with pre-trained models and how well it generalizes to a range of downstream tasks. In particular, our pre-trained model achieved 93.63% accuracy for linear SVM on ScanObjectNN and 91.31% accuracy on ModelNet40. Our approach demonstrates how a straightforward architecture solely based on conventional transformers may outperform specialized transformer models from supervised learning.
尽管Transformer架构已成为自然语言处理相关工作的行业标准,但在计算机视觉领域的应用仍较少。在视觉领域,注意力机制通常与卷积网络结合使用,或者在保留整体网络设计的同时替代单个卷积网络元素。两个领域之间的差异,例如视觉对象规模的显著变化以及图像中像素的粒度比文本中的单词更高,使得将Transformer从语言领域转移到视觉领域变得困难。掩码自动编码是一种很有前景的自监督学习方法,极大地推动了计算机视觉和自然语言处理的发展。对于强大的二维表示,使用大量图像数据进行预训练已成为标准做法。另一方面,由于数据处理成本高,三维数据集的低可用性严重阻碍了高质量三维特征的学习。我们提出了一种强大的多尺度MAE先验训练架构,该架构使用经过训练的ViT和来自二维图像的三维表示模型,让三维点云自主学习。我们利用熟练的二维信息来指导基于三维掩码的自动编码器,该自动编码器使用编码器-解码器架构通过自监督预训练来重建被掩码的点令牌。为了获取输入点云的多视图视觉特征,我们首先使用预训练的二维模型。接下来,我们提出一种二维掩码方法,该方法保留语义上重要的点令牌的可见性。大量测试表明我们的方法在与预训练模型配合使用时的有效性,以及它对一系列下游任务的泛化能力。特别是,我们的预训练模型在ScanObjectNN上的线性支持向量机准确率达到了93.63%,在ModelNet40上的准确率达到了91.31%。我们的方法展示了一个仅基于传统Transformer的简单架构如何能够超越监督学习中的专门Transformer模型。