基于实例和 RGB 信息融合的无监督单目视觉里程计。

Unsupervised monocular visual odometry via combining instance and RGB information.

出版信息

Appl Opt. 2022 May 1;61(13):3793-3803. doi: 10.1364/AO.452378.

Abstract

Unsupervised deep learning methods have made significant progress in monocular visual odometry (VO) tasks. However, due to the complexity of the real-world scene, learning the camera ego-motion from the RGB information of monocular images in an unsupervised way is still challenging. Existing methods mainly learn motion from the original RGB information, lacking higher-level input from scene understanding. Hence, this paper proposes an unsupervised monocular VO framework that combines the instance and RGB information, named combined information based (CI-VO). The proposed method includes two stages. First is obtaining the instance maps of the monocular images, without finetuning on the VO dataset. Then we obtain the combined information from the two types of information, which is input into the proposed combined information based pose estimation network, named CI-PoseNet, to estimate the relative pose of the camera. To make better use of the two types of information, we propose a fusion feature extraction network to extract the fused features from the combined information. Experiments on the KITTI odometry and KITTI raw dataset show that the proposed method has good performance in the camera pose estimation task, which exceeds the existing mainstream methods.

摘要

无监督深度学习方法在单目视觉里程计（VO）任务中取得了重大进展。然而，由于现实场景的复杂性，从单目图像的 RGB 信息中以无监督的方式学习相机自身运动仍然具有挑战性。现有的方法主要从原始 RGB 信息中学习运动，缺乏来自场景理解的更高层次的输入。因此，本文提出了一种结合实例和 RGB 信息的无监督单目 VO 框架，称为基于组合信息的（CI-VO）。该方法包括两个阶段。首先是获取单目图像的实例图，而无需在 VO 数据集上进行微调。然后，我们从两种类型的信息中获取组合信息，将其输入到我们提出的基于组合信息的姿态估计网络（CI-PoseNet）中，以估计相机的相对姿态。为了更好地利用两种类型的信息，我们提出了一种融合特征提取网络，从组合信息中提取融合特征。在 KITTI 里程计和 KITTI 原始数据集上的实验表明，所提出的方法在相机姿态估计任务中具有良好的性能，超过了现有的主流方法。