Suppr超能文献

RT-ViT:基于轻量级视觉Transformer 的实时单目深度估计。

RT-ViT: Real-Time Monocular Depth Estimation Using Lightweight Vision Transformers.

机构信息

Department of Information and Communication Engineering, School of Electrical and Computer Engineering, Chungbuk National University, Cheongju-si 28644, Korea.

Electrical Engineering Department, Faculty of Engineering, Assiut University, Assiut 71515, Egypt.

出版信息

Sensors (Basel). 2022 May 19;22(10):3849. doi: 10.3390/s22103849.

Abstract

The latest research in computer vision highlighted the effectiveness of the vision transformers (ViT) in performing several computer vision tasks; they can efficiently understand and process the image globally unlike the convolution which processes the image locally. ViTs outperform the convolutional neural networks in terms of accuracy in many computer vision tasks but the speed of ViTs is still an issue, due to the excessive use of the transformer layers that include many fully connected layers. Therefore, we propose a real-time ViT-based monocular depth estimation (depth estimation from single RGB image) method with encoder-decoder architectures for indoor and outdoor scenes. This main architecture of the proposed method consists of a vision transformer encoder and a convolutional neural network decoder. We started by training the base vision transformer (ViT-b16) with 12 transformer layers then we reduced the transformer layers to six layers, namely ViT-s16 (the Small ViT) and four layers, namely ViT-t16 (the Tiny ViT) to obtain real-time processing. We also try four different configurations of the CNN decoder network. The proposed architectures can learn the task of depth estimation efficiently and can produce more accurate depth predictions than the fully convolutional-based methods taking advantage of the multi-head self-attention module. We train the proposed encoder-decoder architecture end-to-end on the challenging NYU-depthV2 and CITYSCAPES benchmarks then we evaluate the trained models on the validation and test sets of the same benchmarks showing that it outperforms many state-of-the-art methods on depth estimation while performing the task in real-time (∼20 fps). We also present a fast 3D reconstruction (∼17 fps) experiment based on the depth estimated from our method which is considered a real-world application of our method.

摘要

计算机视觉的最新研究强调了视觉转换器 (ViT) 在执行多种计算机视觉任务方面的有效性;与卷积不同,它们可以有效地全局理解和处理图像,而卷积则局部处理图像。在许多计算机视觉任务中,ViTs 在准确性方面优于卷积神经网络,但由于过多地使用了包括许多全连接层在内的转换器层,因此它们的速度仍然是一个问题。因此,我们提出了一种基于实时 ViT 的单目深度估计(从单个 RGB 图像进行深度估计)方法,具有编码器-解码器架构,适用于室内和室外场景。该方法的主要架构由视觉转换器编码器和卷积神经网络解码器组成。我们首先使用 12 个转换器层训练基础视觉转换器 (ViT-b16),然后将转换器层减少到 6 层,即 ViT-s16(小型 ViT)和 4 层,即 ViT-t16(微型 ViT),以实现实时处理。我们还尝试了 CNN 解码器网络的四种不同配置。所提出的架构可以有效地学习深度估计任务,并利用多头自注意力模块生成比基于全卷积的方法更准确的深度预测。我们在具有挑战性的 NYU-depthV2 和 CITYSCAPES 基准上对端到端训练的编码器-解码器架构进行评估,然后在同一基准的验证集和测试集上评估训练模型,结果表明,它在实时执行任务时(∼20 fps)在深度估计方面优于许多最先进的方法。我们还根据我们的方法估计的深度呈现了一个快速的 3D 重建(∼17 fps)实验,这被认为是我们方法的实际应用。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3b84/9143167/146d40ae5d88/sensors-22-03849-g0A1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验