Hu Haifeng, Feng Yuyang, Li Dapeng, Zhang Suofei, Zhao Haitao
College of Telecommunications and Information Engineering, Nanjing University of Posts and Telecommunications, Nanjing 210003, China.
College of Internet of Things, Nanjing University of Posts and Telecommunications, Nanjing 210003, China.
Sensors (Basel). 2024 Jun 24;24(13):4090. doi: 10.3390/s24134090.
Self-supervised monocular depth estimation can exhibit excellent performance in static environments due to the multi-view consistency assumption during the training process. However, it is hard to maintain depth consistency in dynamic scenes when considering the occlusion problem caused by moving objects. For this reason, we propose a method of self-supervised self-distillation for monocular depth estimation (SS-MDE) in dynamic scenes, where a deep network with a multi-scale decoder and a lightweight pose network are designed to predict depth in a self-supervised manner via the disparity, motion information, and the association between two adjacent frames in the image sequence. Meanwhile, in order to improve the depth estimation accuracy of static areas, the pseudo-depth images generated by the LeReS network are used to provide the pseudo-supervision information, enhancing the effect of depth refinement in static areas. Furthermore, a forgetting factor is leveraged to alleviate the dependency on the pseudo-supervision. In addition, a teacher model is introduced to generate depth prior information, and a multi-view mask filter module is designed to implement feature extraction and noise filtering. This can enable the student model to better learn the deep structure of dynamic scenes, enhancing the generalization and robustness of the entire model in a self-distillation manner. Finally, on four public data datasets, the performance of the proposed SS-MDE method outperformed several state-of-the-art monocular depth estimation techniques, achieving an accuracy (δ1) of 89% while minimizing the error (AbsRel) by 0.102 in NYU-Depth V2 and achieving an accuracy (δ1) of 87% while minimizing the error (AbsRel) by 0.111 in KITTI.
由于在训练过程中采用了多视图一致性假设,自监督单目深度估计在静态环境中能够展现出优异的性能。然而,在动态场景中,考虑到移动物体引起的遮挡问题,很难保持深度一致性。因此,我们提出了一种用于动态场景单目深度估计的自监督自蒸馏方法(SS-MDE),其中设计了一个具有多尺度解码器的深度网络和一个轻量级姿态网络,以通过视差、运动信息以及图像序列中两个相邻帧之间的关联,以自监督的方式预测深度。同时,为了提高静态区域的深度估计精度,利用LeReS网络生成的伪深度图像来提供伪监督信息,增强静态区域深度细化的效果。此外,引入遗忘因子以减轻对伪监督的依赖。另外,引入教师模型来生成深度先验信息,并设计了一个多视图掩码滤波模块来进行特征提取和噪声滤波。这可以使学生模型更好地学习动态场景的深层结构,以自蒸馏的方式增强整个模型的泛化能力和鲁棒性。最后,在四个公共数据数据集上,所提出的SS-MDE方法的性能优于几种先进的单目深度估计技术,在NYU-Depth V2中,准确率(δ1)达到89%,同时将误差(AbsRel)最小化0.102;在KITTI中,准确率(δ1)达到87%,同时将误差(AbsRel)最小化0.111。