Yang Wei-Jong, Wu Chih-Chen, Yang Jar-Ferr
Department of Artificial Intelligence and Computer Engineering, National Chin-Yi University of Technology, Taichung 411, Taiwan.
Institute of Computer and Communication Engineering, Department of Electrical Engineering, National Cheng Kung University, Tainan 701, Taiwan.
Sensors (Basel). 2024 Dec 26;25(1):80. doi: 10.3390/s25010080.
Precision depth estimation plays a key role in many applications, including 3D scene reconstruction, virtual reality, autonomous driving and human-computer interaction. Through recent advancements in deep learning technologies, monocular depth estimation, with its simplicity, has surpassed the traditional stereo camera systems, bringing new possibilities in 3D sensing. In this paper, by using a single camera, we propose an end-to-end supervised monocular depth estimation autoencoder, which contains an encoder with a structure with a mixed convolution neural network and vision transformers and an effective adaptive fusion decoder to obtain high-precision depth maps. In the encoder, we construct a multi-scale feature extractor by mixing residual configurations of vision transformers to enhance both local and global information. In the adaptive fusion decoder, we introduce adaptive fusion modules to effectively merge the features of the encoder and the decoder together. Lastly, the model is trained using a loss function that aligns with human perception to enable it to focus on the depth values of foreground objects. The experimental results demonstrate the effective prediction of the depth map from a single-view color image by the proposed autoencoder, which increases the first accuracy rate about 28% and reduces the root mean square error about 27% compared to an existing method in the NYU dataset.
精确深度估计在许多应用中起着关键作用,包括三维场景重建、虚拟现实、自动驾驶和人机交互。通过深度学习技术的最新进展,单目深度估计凭借其简单性超越了传统的立体相机系统,为三维传感带来了新的可能性。在本文中,我们使用单相机提出了一种端到端监督的单目深度估计自动编码器,它包含一个具有混合卷积神经网络和视觉Transformer结构的编码器以及一个有效的自适应融合解码器,以获得高精度的深度图。在编码器中,我们通过混合视觉Transformer的残差配置来构建多尺度特征提取器,以增强局部和全局信息。在自适应融合解码器中,我们引入自适应融合模块,将编码器和解码器的特征有效地融合在一起。最后,使用与人类感知一致的损失函数对模型进行训练,使其能够专注于前景物体的深度值。实验结果表明,所提出的自动编码器能够有效地从单视图彩色图像预测深度图,与纽约大学数据集中的现有方法相比,首次准确率提高了约28%,均方根误差降低了约27%。