IEEE Trans Pattern Anal Mach Intell. 2022 Jun;44(6):3212-3223. doi: 10.1109/TPAMI.2020.3047388. Epub 2022 May 5.
This paper addresses the problem of instance-level 6DoF object pose estimation from a single RGB image. Many recent works have shown that a two-stage approach, which first detects keypoints and then solves a Perspective-n-Point (PnP) problem for pose estimation, achieves remarkable performance. However, most of these methods only localize a set of sparse keypoints by regressing their image coordinates or heatmaps, which are sensitive to occlusion and truncation. Instead, we introduce a Pixel-wise Voting Network (PVNet) to regress pixel-wise vectors pointing to the keypoints and use these vectors to vote for keypoint locations. This creates a flexible representation for localizing occluded or truncated keypoints. Another important feature of this representation is that it provides uncertainties of keypoint locations that can be further leveraged by the PnP solver. Experiments show that the proposed approach outperforms the state of the art on the LINEMOD, Occluded LINEMOD, YCB-Video, and Tless datasets, while being efficient for real-time pose estimation. We further create a Truncated LINEMOD dataset to validate the robustness of our approach against truncation. The code is available at https://github.com/zju3dv/pvnet.
本文针对从单张 RGB 图像估计实例级 6DoF 物体姿态的问题。许多最新的研究表明,两阶段方法,即首先检测关键点,然后解决用于姿态估计的透视 n 点(PnP)问题,可实现卓越的性能。然而,这些方法大多只通过回归其图像坐标或热图来定位一组稀疏的关键点,这对遮挡和截断很敏感。相反,我们引入了一种像素级投票网络(PVNet),用于回归指向关键点的像素级向量,并使用这些向量对关键点位置进行投票。这为定位遮挡或截断的关键点创建了一种灵活的表示。这种表示的另一个重要特点是,它提供了关键点位置的不确定性,这可以进一步被 PnP 求解器利用。实验表明,所提出的方法在 LINEMOD、遮挡 LINEMOD、YCB-Video 和 Tless 数据集上优于最新技术水平,同时能够高效地进行实时姿态估计。我们进一步创建了一个截断 LINEMOD 数据集,以验证我们的方法对截断的鲁棒性。代码可在 https://github.com/zju3dv/pvnet 上获得。