Sui Xin, Gao Song, Xu Aigong, Zhang Cong, Wang Changqiang, Shi Zhengxu
School of Geomatics, Liaoning Technical University, Fuxin, 123000, China.
Sci Rep. 2024 Sep 28;14(1):22472. doi: 10.1038/s41598-024-72682-8.
The existing deep estimation networks often overlook the issue of computational efficiency while pursuing high accuracy. This paper proposes a lightweight self-supervised network that combines convolutional neural networks (CNN) and Transformers as the feature extraction and encoding layers for images, enabling the network to capture both local geometric and global semantic features for depth estimation. First, depth-separable convolution is used to construct a dilated convolution residual module based on a shallow network to improve the shallow CNN feature extraction receptive field. In the transformer, a multidepth separable convolution head transposed attention module is proposed to reduce the computational burden of spatial self-attention. In the feedforward network, a two-step gating mechanism is proposed to improve the nonlinear representation ability of the feedforward network. Finally, the CNN and transformer are integrated to implement a depth estimation network with a local-global context interaction function. Compared with other lightweight models, this model has fewer model parameters and higher estimation accuracy. It also has better generalizability for different outdoor datasets. Additionally, the inference speed can reach 87 FPS, achieving better real-time performance and accounting for both inference speed and estimation accuracy.
现有的深度估计网络在追求高精度时往往忽略了计算效率问题。本文提出了一种轻量级自监督网络,该网络将卷积神经网络(CNN)和Transformer作为图像的特征提取和编码层,使网络能够捕捉用于深度估计的局部几何特征和全局语义特征。首先,使用深度可分离卷积基于浅层网络构建扩张卷积残差模块,以改善浅层CNN特征提取感受野。在Transformer中,提出了一个多深度可分离卷积头转置注意力模块,以减轻空间自注意力的计算负担。在前馈网络中,提出了一种两步门控机制,以提高前馈网络的非线性表示能力。最后,将CNN和Transformer集成以实现具有局部-全局上下文交互功能的深度估计网络。与其他轻量级模型相比,该模型具有更少的模型参数和更高的估计精度。它对不同的室外数据集也具有更好的通用性。此外,推理速度可达87帧每秒,实现了更好的实时性能,兼顾了推理速度和估计精度。