基于时空变压器与柯尔莫哥洛夫 - 阿诺德网络的骨架手部姿态识别 - Suppr

Han Pengcheng, He Xin, Matsumaru Takafumi, Dutta Vibekananda

Graduate School of Information, Production and System, Waseda University, Kitakyushu 808-0135, Japan.

Institute of Micromechanics and Photonics, Faculty of Mechatronics, Warsaw University of Technology, 00-661 Warszawa, Poland.

Sensors (Basel). 2025 Jan 24;25(3):702. doi: 10.3390/s25030702.

Manually crafted features often suffer from being subjective, having an inadequate accuracy, or lacking in robustness in recognition. Meanwhile, existing deep learning methods often overlook the structural and dynamic characteristics of the human hand, failing to fully explore the contextual information of joints in both the spatial and temporal domains. To effectively capture dependencies between the hand joints that are not adjacent but may have potential connections, it is essential to learn long-term relationships. This study proposes a skeleton-based hand gesture recognition framework, the ST-KT, a spatio-temporal graph convolution network, and a transformer with the Kolmogorov-Arnold Network (KAN) model. It incorporates spatio-temporal graph convolution network (ST-GCN) modules and a spatio-temporal transformer module with KAN (KAN-Transformer). ST-GCN modules, which include a spatial graph convolution network (SGCN) and a temporal convolution network (TCN), extract primary features from skeleton sequences by leveraging the strength of graph convolutional networks in the spatio-temporal domain. A spatio-temporal position embedding method integrates node features, enriching representations by including node identities and temporal information. The transformer layer includes a spatial KAN-Transformer (S-KT) and a temporal KAN-Transformer (T-KT), which further extract joint features by learning edge weights and node embeddings, providing richer feature representations and the capability for nonlinear modeling. We evaluated the performance of our method on two challenging skeleton-based dynamic gesture datasets: our method achieved an accuracy of 97.5% on the SHREC'17 track dataset and 94.3% on the DHG-14/28 dataset. These results demonstrate that our proposed method, ST-KT, effectively captures dynamic skeleton changes and complex joint relationships.

手工制作的特征往往存在主观性、准确性不足或识别鲁棒性欠缺的问题。同时，现有的深度学习方法常常忽略人手的结构和动态特征，未能充分探索关节在空间和时间域中的上下文信息。为了有效捕捉不相邻但可能存在潜在连接的手部关节之间的依赖关系，学习长期关系至关重要。本研究提出了一种基于骨架的手势识别框架ST-KT，它是一个时空图卷积网络以及一个带有柯尔莫哥洛夫 - 阿诺德网络（KAN）模型的变压器。它结合了时空图卷积网络（ST-GCN）模块和带有KAN的时空变压器模块（KAN-Transformer）。ST-GCN模块包括空间图卷积网络（SGCN）和时间卷积网络（TCN），通过利用图卷积网络在时空域的优势从骨架序列中提取主要特征。一种时空位置嵌入方法整合节点特征，通过包含节点标识和时间信息来丰富表示。变压器层包括空间KAN-Transformer（S-KT）和时间KAN-Transformer（T-KT），它们通过学习边权重和节点嵌入进一步提取关节特征，提供更丰富的特征表示和非线性建模能力。我们在两个具有挑战性的基于骨架的动态手势数据集上评估了我们方法的性能：我们的方法在SHREC'17轨迹数据集上达到了97.5%的准确率，在DHG-14/28数据集上达到了94.3%的准确率。这些结果表明，我们提出的方法ST-KT有效地捕捉了动态骨架变化和复杂的关节关系。