Zhao Mingle, Zhou Dingfu, Song Xibin, Chen Xiuwan, Zhang Liangjun
Institute of Remote Sensing and Geographic Information System, Peking University, Beijing 100871, China.
Robotics and Autonomous Driving Laboratory, Baidu Research, Beijing 100085, China.
Sensors (Basel). 2022 Apr 28;22(9):3389. doi: 10.3390/s22093389.
Recently, generating dense maps in real-time has become a hot research topic in the mobile robotics community, since dense maps can provide more informative and continuous features compared with sparse maps. Implicit depth representation (e.g., the depth code) derived from deep neural networks has been employed in the visual-only or visual-inertial simultaneous localization and mapping (SLAM) systems, which achieve promising performances on both camera motion and local dense geometry estimations from monocular images. However, the existing visual-inertial SLAM systems combined with depth codes are either built on a filter-based SLAM framework, which can only update poses and maps in a relatively small local time window, or based on a loosely-coupled framework, while the prior geometric constraints from the depth estimation network have not been employed for boosting the state estimation. To well address these drawbacks, we propose DiT-SLAM, a novel real-time ense visual-inertial SLAM with mplicit depth representation and ightly-coupled graph optimization. Most importantly, the poses, sparse maps, and low-dimensional depth codes are optimized with the tightly-coupled graph by considering the visual, inertial, and depth residuals simultaneously. Meanwhile, we propose a light-weight monocular depth estimation and completion network, which is combined with attention mechanisms and the conditional variational auto-encoder (CVAE) to predict the uncertainty-aware dense depth maps from more low-dimensional codes. Furthermore, a robust point sampling strategy introducing the spatial distribution of 2D feature points is also proposed to provide geometric constraints in the tightly-coupled optimization, especially for textureless or featureless cases in indoor environments. We evaluate our system on open benchmarks. The proposed methods achieve better performances on both the dense depth estimation and the trajectory estimation compared to the baseline and other systems.
近年来,实时生成密集地图已成为移动机器人领域的一个热门研究课题,因为与稀疏地图相比,密集地图可以提供更多信息且连续的特征。源自深度神经网络的隐式深度表示(例如深度码)已被应用于仅视觉或视觉惯性同步定位与建图(SLAM)系统中,这些系统在相机运动和从单目图像进行局部密集几何估计方面都取得了不错的性能。然而,现有的结合深度码的视觉惯性SLAM系统要么基于基于滤波器的SLAM框架构建,该框架只能在相对较小的局部时间窗口内更新位姿和地图,要么基于松耦合框架,而深度估计网络的先验几何约束尚未用于提升状态估计。为了很好地解决这些缺点,我们提出了DiT-SLAM,一种具有隐式深度表示和紧耦合图优化的新型实时密集视觉惯性SLAM。最重要的是,通过同时考虑视觉、惯性和深度残差,利用紧耦合图对位姿、稀疏地图和低维深度码进行优化。同时,我们提出了一个轻量级的单目深度估计与完成网络,它结合了注意力机制和条件变分自动编码器(CVAE),以从更多低维码预测具有不确定性感知的密集深度图。此外,还提出了一种引入二维特征点空间分布的鲁棒点采样策略,以在紧耦合优化中提供几何约束,特别是对于室内环境中无纹理或无特征的情况。我们在开放基准上评估了我们的系统。与基线和其他系统相比,所提出的方法在密集深度估计和轨迹估计方面都取得了更好的性能。