School of Computer Science and Technology, Shandong University, Qingdao, China; Qingdao Research Institute of Beihang University, Qingdao, China.
State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing, China; Qingdao Research Institute of Beihang University, Qingdao, China.
Neural Netw. 2024 Jun;174:106238. doi: 10.1016/j.neunet.2024.106238. Epub 2024 Mar 16.
Object pose estimation and camera localization are critical in various applications. However, achieving algorithm universality, which refers to category-level pose estimation and scene-independent camera localization, presents challenges for both techniques. Although the two tasks keep close relationships due to spatial geometry constraints, different tasks require distinct feature extractions. This paper pays attention to a unified RGB-D based framework that simultaneously performs category-level object pose estimation and scene-independent camera localization. The framework consists of a pose estimation branch called SLO-ObjNet, a localization branch called SLO-LocNet, a pose confidence calculation process and object-level optimization. At the start, we obtain the initial camera and object results from SLO-LocNet and SLO-ObjNet. In these two networks, we design there-level feature fusion modules as well as the loss function to achieve feature sharing between two tasks. Then the proposed approach involves a confidence calculation process to determine the accuracy of object poses obtained. Additionally, an object-level Bundle Adjustment (BA) optimization algorithm is further used to improve the precision of these techniques. The BA algorithm establishes relationships among feature points, objects, and cameras with the usage of camera-point, camera-object, and object-point metrics. To evaluate the performance of this approach, experiments are conducted on localization and pose estimation datasets including REAL275, CAMERA25, LineMOD, YCB-Video, 7 Scenes, ScanNet and TUM RGB-D. The results show that this approach outperforms existing methods in terms of both estimation and localization accuracy. Additionally, SLO-LocNet and SLO-ObjNet are trained on ScanNet data and tested on 7 Scenes and TUM RGB-D datasets to demonstrate its universality performance. Finally, we also highlight the positive effects of fusion modules, loss function, confidence process and BA for improving overall performance.
物体姿态估计和相机定位在各种应用中至关重要。然而,实现算法通用性,即类别级别的姿态估计和与场景无关的相机定位,对这两种技术都提出了挑战。尽管由于空间几何约束,这两个任务保持着密切的关系,但不同的任务需要不同的特征提取。本文关注一种基于统一 RGB-D 的框架,该框架同时执行类别级别的物体姿态估计和与场景无关的相机定位。该框架由一个称为 SLO-ObjNet 的姿态估计分支、一个称为 SLO-LocNet 的定位分支、一个姿态置信度计算过程和物体级别的优化组成。首先,我们从 SLO-LocNet 和 SLO-ObjNet 中获得初始相机和物体结果。在这两个网络中,我们设计了三级特征融合模块以及损失函数,以实现两个任务之间的特征共享。然后,我们提出的方法涉及置信度计算过程,以确定所获得的物体姿态的准确性。此外,还进一步使用物体级别的束调整 (BA) 优化算法来提高这些技术的精度。BA 算法使用相机-点、相机-物体和物体-点度量在特征点、物体和相机之间建立关系。为了评估该方法的性能,我们在包括 REAL275、CAMERA25、LineMOD、YCB-Video、7 Scenes、ScanNet 和 TUM RGB-D 在内的定位和姿态估计数据集上进行了实验。结果表明,该方法在估计和定位精度方面都优于现有的方法。此外,SLO-LocNet 和 SLO-ObjNet 在 ScanNet 数据上进行训练,并在 7 Scenes 和 TUM RGB-D 数据集上进行测试,以证明其通用性。最后,我们还强调了融合模块、损失函数、置信度过程和 BA 对提高整体性能的积极影响。