基于金字塔变换器和多尺度特征融合的立体视觉密集单目深度估计

Dense monocular depth estimation for stereoscopic vision based on pyramid transformer and multi-scale feature fusion.

作者信息

Xia Zhongyi, Wu Tianzhao, Wang Zhuoyan, Zhou Man, Wu Boqi, Chan C Y, Kong Ling Bing

机构信息

College of New Materials and New Energies, Shenzhen Technology University, Shenzhen, 518118, Guangdong, China.

College of Applied Technology, Shenzhen University, Shenzhen, 518000, Guangdong, China.

出版信息

Sci Rep. 2024 Mar 25;14(1):7037. doi: 10.1038/s41598-024-57908-z.

DOI:10.1038/s41598-024-57908-z

PMID:38528098

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10963766/

Abstract

Stereoscopic display technology plays a significant role in industries, such as film, television and autonomous driving. The accuracy of depth estimation is crucial for achieving high-quality and realistic stereoscopic display effects. In addressing the inherent challenges of applying Transformers to depth estimation, the Stereoscopic Pyramid Transformer-Depth (SPT-Depth) is introduced. This method utilizes stepwise downsampling to acquire both shallow and deep semantic information, which are subsequently fused. The training process is divided into fine and coarse convergence stages, employing distinct training strategies and hyperparameters, resulting in a substantial reduction in both training and validation losses. In the training strategy, a shift and scale-invariant mean square error function is employed to compensate for the lack of translational invariance in the Transformers. Additionally, an edge-smoothing function is applied to reduce noise in the depth map, enhancing the model's robustness. The SPT-Depth achieves a global receptive field while effectively reducing time complexity. In comparison with the baseline method, with the New York University Depth V2 (NYU Depth V2) dataset, there is a 10% reduction in Absolute Relative Error (Abs Rel) and a 36% decrease in Root Mean Square Error (RMSE). When compared with the state-of-the-art methods, there is a 17% reduction in RMSE.

摘要

立体显示技术在电影、电视和自动驾驶等行业中发挥着重要作用。深度估计的准确性对于实现高质量和逼真的立体显示效果至关重要。在解决将Transformer应用于深度估计的固有挑战时，引入了立体金字塔Transformer-深度（SPT-Depth）方法。该方法利用逐步下采样来获取浅层和深层语义信息，随后将它们融合。训练过程分为精细和粗略收敛阶段，采用不同的训练策略和超参数，从而大幅降低了训练损失和验证损失。在训练策略中，采用了平移和尺度不变均方误差函数来弥补Transformer中缺乏平移不变性的问题。此外，应用了边缘平滑函数来减少深度图中的噪声，增强模型的鲁棒性。SPT-Depth在有效降低时间复杂度的同时实现了全局感受野。与基线方法相比，在纽约大学深度数据集V2（NYU Depth V2）上，绝对相对误差（Abs Rel）降低了10%，均方根误差（RMSE）降低了36%。与最先进的方法相比，RMSE降低了17%。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

基于金字塔变换器和多尺度特征融合的立体视觉密集单目深度估计

Dense monocular depth estimation for stereoscopic vision based on pyramid transformer and multi-scale feature fusion.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

基于金字塔变换器和多尺度特征融合的立体视觉密集单目深度估计

Dense monocular depth estimation for stereoscopic vision based on pyramid transformer and multi-scale feature fusion.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献