State Key Laboratory of Robotics and System, Harbin Institute of Technology, 92 Xidazhi Street, Harbin 150006, China.
MFIN, Faculty of Business and Economics, The University of Hong Kong, Pokfulam Road, Hong Kong 999077, China.
Sensors (Basel). 2020 Dec 4;20(23):6943. doi: 10.3390/s20236943.
The traditional CNN for 6D robot relocalization which outputs pose estimations does not interpret whether the model is making sensible predictions or just guessing at random. We found that convnet representations trained on classification problems generalize well to other tasks. Thus, we propose a multi-task CNN for robot relocalization, which can simultaneously perform pose regression and scene recognition. Scene recognition determines whether the input image belongs to the current scene in which the robot is located, not only reducing the error of relocalization but also making us understand with what confidence we can trust the prediction. Meanwhile, we found that when there is a large visual difference between testing images and training images, the pose precision becomes low. Based on this, we present the dual-level image-similarity strategy (DLISS), which consists of two levels: initial level and iteration-level. The initial level performs feature vector clustering in the training set and feature vector acquisition in testing images. The iteration level, namely, the PSO-based image-block selection algorithm, can select the testing images which are the most similar to training images based on the initial level, enabling us to gain higher pose accuracy in testing set. Our method considers both the accuracy and the robustness of relocalization, and it can operate indoors and outdoors in real time, taking at most 27 ms per frame to compute. Finally, we used the Microsoft 7Scenes dataset and the Cambridge Landmarks dataset to evaluate our method. It can obtain approximately 0.33 m and 7.51∘ accuracy on 7Scenes dataset, and get approximately 1.44 m and 4.83∘ accuracy on the Cambridge Landmarks dataset. Compared with PoseNet, our CNN reduced the average positional error by 25% and the average angular error by 27.79% on 7Scenes dataset, and reduced the average positional error by 40% and the average angular error by 28.55% on the Cambridge Landmarks dataset. We show that our multi-task CNN can localize from high-level features and is robust to images which are not in the current scene. Furthermore, we show that our multi-task CNN gets higher accuracy of relocalization by using testing images obtained by DLISS.
传统的用于 6D 机器人重定位的卷积神经网络(CNN)输出姿势估计值,但无法解释模型是做出了合理的预测还是只是随机猜测。我们发现,在分类问题上训练的 convnet 表示可以很好地推广到其他任务。因此,我们提出了一种用于机器人重定位的多任务 CNN,它可以同时执行姿势回归和场景识别。场景识别确定输入图像是否属于机器人所在的当前场景,不仅可以减少重定位的误差,还可以让我们了解我们可以信任预测的置信度。同时,我们发现当测试图像与训练图像之间存在较大的视觉差异时,姿势精度会降低。基于此,我们提出了双级图像相似度策略(DLISS),它由两个级别组成:初始级别和迭代级别。初始级别在训练集中执行特征向量聚类,并在测试图像中获取特征向量。迭代级别,即基于 PSO 的图像块选择算法,可以根据初始级别选择与训练图像最相似的测试图像,从而使我们在测试集中获得更高的姿势精度。我们的方法既考虑了重定位的准确性又考虑了其鲁棒性,它可以在室内和室外实时运行,每帧最多需要 27 毫秒的计算时间。最后,我们使用 Microsoft 7Scenes 数据集和剑桥地标数据集来评估我们的方法。它在 7Scenes 数据集上可以获得约 0.33 米和 7.51∘的精度,在剑桥地标数据集上可以获得约 1.44 米和 4.83∘的精度。与 PoseNet 相比,我们的 CNN 在 7Scenes 数据集上平均位置误差降低了 25%,平均角度误差降低了 27.79%,在剑桥地标数据集上平均位置误差降低了 40%,平均角度误差降低了 28.55%。我们表明,我们的多任务 CNN 可以从高层特征进行定位,并且对不在当前场景的图像具有鲁棒性。此外,我们表明,我们的多任务 CNN 通过使用 DLISS 获得的测试图像可以获得更高的重定位精度。