Yu Shaoqi, Wang Yintong, Chen Lili, Zhang Xiaolin, Li Jiamao
Shanghai Institute of Microsystem and Information Technology, Chinese Academy of Sciences, Shanghai, China.
University of Chinese Academy of Sciences, Beijing, China.
Front Neurorobot. 2024 May 3;18:1395652. doi: 10.3389/fnbot.2024.1395652. eCollection 2024.
In Human-Robot Interaction (HRI), accurate 3D hand pose and mesh estimation hold critical importance. However, inferring reasonable and accurate poses in severe self-occlusion and high self-similarity remains an inherent challenge. In order to alleviate the ambiguity caused by invisible and similar joints during HRI, we propose a new Topology-aware Transformer network named HandGCNFormer with depth image as input, incorporating prior knowledge of hand kinematic topology into the network while modeling long-range contextual information. Specifically, we propose a novel Graphformer decoder with an additional Node-offset Graph Convolutional layer (NoffGConv). The Graphformer decoder optimizes the synergy between the Transformer and GCN, capturing long-range dependencies and local topological connections between joints. On top of that, we replace the standard MLP prediction head with a novel Topology-aware head to better exploit local topological constraints for more reasonable and accurate poses. Our method achieves state-of-the-art 3D hand pose estimation performance on four challenging datasets, including Hands2017, NYU, ICVL, and MSRA. To further demonstrate the effectiveness and scalability of our proposed Graphformer Decoder and Topology aware head, we extend our framework to HandGCNFormer-Mesh for the 3D hand mesh estimation task. The extended framework efficiently integrates a shape regressor with the original Graphformer Decoder and Topology aware head, producing Mano parameters. The results on the HO-3D dataset, which contains various and challenging occlusions, show that our HandGCNFormer-Mesh achieves competitive results compared to previous state-of-the-art 3D hand mesh estimation methods.
在人机交互(HRI)中,准确的三维手部姿态和网格估计至关重要。然而,在严重的自我遮挡和高度自相似情况下推断合理且准确的姿态仍然是一个内在挑战。为了减轻HRI期间不可见和相似关节造成的模糊性,我们提出了一种名为HandGCNFormer的新型拓扑感知Transformer网络,以深度图像作为输入,在对长距离上下文信息进行建模时将手部运动拓扑的先验知识融入网络。具体而言,我们提出了一种带有额外节点偏移图卷积层(NoffGConv)的新型Graphformer解码器。Graphformer解码器优化了Transformer和GCN之间的协同作用,捕捉关节之间的长距离依赖和局部拓扑连接。在此基础上,我们用一种新型的拓扑感知头部取代标准的MLP预测头,以更好地利用局部拓扑约束来获得更合理和准确的姿态。我们的方法在包括Hands2017、NYU、ICVL和MSRA在内的四个具有挑战性的数据集上实现了领先的三维手部姿态估计性能。为了进一步证明我们提出的Graphformer解码器和拓扑感知头部的有效性和可扩展性,我们将我们的框架扩展到用于三维手部网格估计任务的HandGCNFormer-Mesh。扩展后的框架有效地将形状回归器与原始的Graphformer解码器和拓扑感知头部集成在一起,生成Mano参数。在包含各种具有挑战性遮挡情况的HO-3D数据集上的结果表明,与之前领先的三维手部网格估计方法相比,我们的HandGCNFormer-Mesh取得了具有竞争力的结果。