Wu Tianzhao, Xia Zhongyi, Zhou Man, Kong Ling Bing, Chen Zengyuan
College of New Materials and New Energies, Shenzhen University of Technology, Shenzhen, 518118, Guangdong, China.
College of Applied Technology, Shenzhen University, Shenzhen, 518060, Guangdong, China.
Sci Rep. 2024 Mar 11;14(1):5868. doi: 10.1038/s41598-024-56095-1.
Monocular depth estimation has a wide range of applications in the field of autostereoscopic displays, while accuracy and robustness in complex scenes are still a challenge. In this paper, we propose a depth estimation network for autostereoscopic displays, which aims at improving the accuracy of monocular depth estimation by fusing Vision Transformer (ViT) and Convolutional Neural Network (CNN). Our approach feeds the input image as a sequence of visual features into the ViT module and utilizes its global perception capability to extract high-level semantic features of the image. The relationship between the losses is quantified by adding a weight correction module to improve robustness of the model. Experimental evaluation results on several public datasets show that AMENet exhibits higher accuracy and robustness than existing methods in different scenarios and complex conditions. In addition, a detailed experimental analysis was conducted to verify the effectiveness and stability of our method. The accuracy improvement on the KITTI dataset compared to the baseline method is 4.4%. In summary, AMENet is a promising depth estimation method with sufficient high robustness and accuracy for monocular depth estimation tasks.
单目深度估计在自动立体显示领域有广泛应用,然而在复杂场景中的准确性和鲁棒性仍是一项挑战。在本文中,我们提出了一种用于自动立体显示的深度估计网络,旨在通过融合视觉Transformer(ViT)和卷积神经网络(CNN)来提高单目深度估计的准确性。我们的方法将输入图像作为一系列视觉特征输入到ViT模块中,并利用其全局感知能力来提取图像的高级语义特征。通过添加权重校正模块来量化损失之间的关系,以提高模型的鲁棒性。在几个公共数据集上的实验评估结果表明,AMENet在不同场景和复杂条件下比现有方法具有更高的准确性和鲁棒性。此外,还进行了详细的实验分析以验证我们方法的有效性和稳定性。与基线方法相比,在KITTI数据集上的准确性提高了4.4%。总之,AMENet是一种有前途的深度估计方法,对于单目深度估计任务具有足够高的鲁棒性和准确性。