Yu Sheng, Zhai Di-Hua, Guan Yuyin, Xia Yuanqing
IEEE Trans Neural Netw Learn Syst. 2025 Jan;36(1):1857-1871. doi: 10.1109/TNNLS.2023.3330011. Epub 2025 Jan 7.
Category-level 6-D object pose estimation plays a crucial role in achieving reliable robotic grasp detection. However, the disparity between synthetic and real datasets hinders the direct transfer of models trained on synthetic data to real-world scenarios, leading to ineffective results. Additionally, creating large-scale real datasets is a time-consuming and labor-intensive task. To overcome these challenges, we propose CatDeform, a novel category-level object pose estimation network trained on synthetic data but capable of delivering good performance on real datasets. In our approach, we introduce a transformer-based fusion module that enables the network to leverage multiple sources of information and enhance prediction accuracy through feature fusion. To ensure proper deformation of the prior point cloud to align with scene objects, we propose a transformer-based attention module that deforms the prior point cloud from both geometric and feature perspectives. Building upon CatDeform, we design a two-branch network for supervised learning, bridging the gap between synthetic and real datasets and achieving high-precision pose estimation in real-world scenes using predominantly synthetic data supplemented with a small amount of real data. To minimize reliance on large-scale real datasets, we train the network in a self-supervised manner by estimating object poses in real scenes based on the synthetic dataset without manual annotation. We conduct training and testing on CAMERA25 and REAL275 datasets, and our experimental results demonstrate that the proposed method outperforms state-of-the-art (SOTA) techniques in both self-supervised and supervised training paradigms. Finally, we apply CatDeform to object pose estimation and robotic grasp experiments in real-world scenarios, showcasing a higher grasp success rate.
类别级6D物体位姿估计在实现可靠的机器人抓取检测中起着至关重要的作用。然而,合成数据集和真实数据集之间的差异阻碍了在合成数据上训练的模型直接应用于现实世界场景,导致效果不佳。此外,创建大规模真实数据集是一项耗时且费力的任务。为了克服这些挑战,我们提出了CatDeform,这是一种新颖的类别级物体位姿估计网络,它在合成数据上进行训练,但能够在真实数据集上表现出良好的性能。在我们的方法中,我们引入了一个基于Transformer的融合模块,使网络能够利用多种信息源,并通过特征融合提高预测精度。为了确保先验点云的适当变形以与场景物体对齐,我们提出了一个基于Transformer的注意力模块,从几何和特征两个角度对先验点云进行变形。基于CatDeform,我们设计了一个用于监督学习的双分支网络,弥合了合成数据集和真实数据集之间的差距,并使用主要是合成数据并辅以少量真实数据在现实世界场景中实现高精度位姿估计。为了尽量减少对大规模真实数据集的依赖,我们通过基于合成数据集在无人工标注的情况下估计真实场景中的物体位姿,以自监督的方式训练网络。我们在CAMERA25和REAL275数据集上进行训练和测试,实验结果表明,所提出的方法在自监督和监督训练范式中均优于现有技术(SOTA)。最后,我们将CatDeform应用于现实世界场景中的物体位姿估计和机器人抓取实验,展示了更高的抓取成功率。