Su Siyuan, Wu Jian
National Key Laboratory of Automotive Chassis Integration and Bionics, Jilin University, Changchun 130025, China.
Sensors (Basel). 2024 Dec 18;24(24):8066. doi: 10.3390/s24248066.
Depth completion is widely employed in Simultaneous Localization and Mapping (SLAM) and Structure from Motion (SfM), which are of great significance to the development of autonomous driving. Recently, the methods based on the fusion of vision transformer (ViT) and convolution have brought the accuracy to a new level. However, there are still two shortcomings that need to be solved. On the one hand, for the poor performance of ViT in details, this paper proposes a semi-convolutional vision transformer to optimize local continuity and designs a geometric perception module to learn the positional correlation and geometric features of sparse points in three-dimensional space to perceive the geometric structures in depth maps for optimizing the recovery of edges and transparent areas. On the other hand, previous methods implement single-stage fusion to directly concatenate or add the outputs of ViT and convolution, resulting in incomplete fusion of the two, especially in complex outdoor scenes, which will generate lots of outliers and ripples. This paper proposes a novel double-stage fusion strategy, applying learnable confidence after self-attention to flexibly learn the weight of local features. Our network achieves state-of-the-art (SoTA) performance with the NYU-Depth-v2 Dataset and the KITTI Depth Completion Dataset. It is worth mentioning that the root mean square error (RMSE) of our model on the NYU-Depth-v2 Dataset is 87.9 mm, which is currently the best among all algorithms. At the end of the article, we also verified the generalization ability in real road scenes.
深度补全在同步定位与地图构建(SLAM)和运动结构恢复(SfM)中得到了广泛应用,这对自动驾驶的发展具有重要意义。最近,基于视觉Transformer(ViT)与卷积融合的方法将精度提升到了一个新水平。然而,仍有两个缺点需要解决。一方面,针对ViT在细节方面表现不佳的问题,本文提出了一种半卷积视觉Transformer来优化局部连续性,并设计了一个几何感知模块,以学习三维空间中稀疏点的位置相关性和几何特征,从而感知深度图中的几何结构,优化边缘和透明区域的恢复。另一方面,先前的方法采用单阶段融合,直接将ViT和卷积的输出连接或相加,导致两者融合不充分,尤其是在复杂的户外场景中,会产生大量异常值和波动。本文提出了一种新颖的双阶段融合策略,在自注意力之后应用可学习的置信度来灵活学习局部特征的权重。我们的网络在NYU-Depth-v2数据集和KITTI深度补全数据集上取得了领先(SoTA)性能。值得一提的是,我们的模型在NYU-Depth-v2数据集上的均方根误差(RMSE)为87.9毫米,这是目前所有算法中最好的。在文章结尾,我们还在真实道路场景中验证了其泛化能力。