Xu Yan, Lin Kwan-Yee, Zhang Guofeng, Wang Xiaogang, Li Hongsheng
IEEE Trans Pattern Anal Mach Intell. 2024 Jul;46(7):4669-4683. doi: 10.1109/TPAMI.2024.3360181. Epub 2024 Jun 5.
6-DoF object pose estimation from a monocular image is a challenging problem, where a post-refinement procedure is generally needed for high-precision estimation. In this paper, we propose a framework, dubbed RNNPose, based on a recurrent neural network (RNN) for object pose refinement, which is robust to erroneous initial poses and occlusions. During the recurrent iterations, object pose refinement is formulated as a non-linear least squares problem based on the estimated correspondence field (between a rendered image and the observed image). The problem is then solved by a differentiable Levenberg-Marquardt (LM) algorithm enabling end-to-end training. The correspondence field estimation and pose refinement are conducted alternately in each iteration to improve the object poses. Furthermore, to improve the robustness against occlusion, we introduce a consistency-check mechanism based on the learned descriptors of the 3D model and observed 2D images, which downweights the unreliable correspondences during pose optimization. We evaluate RNNPose on several public datasets, including LINEMOD, Occlusion-LINEMOD, YCB-Video and TLESS. We demonstrate state-of-the-art performance and strong robustness against severe clutter and occlusion in the scenes. Extensive experiments validate the effectiveness of our proposed method. Besides, the extended system based on RNNPose successfully generalizes to multi-instance scenarios and achieves top-tier performance on the TLESS dataset.
从单目图像进行六自由度物体位姿估计是一个具有挑战性的问题,通常需要一个后处理步骤来进行高精度估计。在本文中,我们提出了一个名为RNNPose的框架,它基于循环神经网络(RNN)用于物体位姿优化,对错误的初始位姿和遮挡具有鲁棒性。在循环迭代过程中,物体位姿优化被表述为基于估计的对应场(在渲染图像和观察图像之间)的非线性最小二乘问题。然后通过可微的列文伯格-马夸尔特(LM)算法解决该问题,实现端到端训练。在每次迭代中,对应场估计和位姿优化交替进行以改善物体位姿。此外,为了提高对遮挡的鲁棒性,我们引入了一种基于3D模型和观察到的2D图像的学习描述符的一致性检查机制,该机制在位姿优化期间降低不可靠对应关系的权重。我们在几个公共数据集上评估了RNNPose,包括LINEMOD、Occlusion-LINEMOD、YCB-Video和TLESS。我们展示了在场景中针对严重 clutter 和遮挡的最优性能和强大鲁棒性。广泛的实验验证了我们提出方法的有效性。此外,基于RNNPose的扩展系统成功推广到多实例场景,并在TLESS数据集上取得了顶级性能。