Wang Suli
Faculty of Data Science, City University of Macau, Taipa, 999078, Macau, China.
School of Computer Engineering, Guangzhou City University of Technology, Guangzhou, 510800, China.
Sci Rep. 2025 Jul 1;15(1):21318. doi: 10.1038/s41598-025-05935-9.
This study proposes a real-time augmented reality gesture interaction algorithm based on the Swin Transformer and a masked self-encoder. This algorithm solves the challenges of the traditional Transformer model regarding spatio-temporal feature extraction and real-time performance. During data preprocessing, the study uses a synthetic data annotation method to automatically generate 3D gesture images and annotate joint information, significantly improving data annotation efficiency. Using weighted Euclidean distance and structural similarity optimization, the paper proposes an image denoising model based on maximum a posteriori probability that effectively reduces noise interference in gesture image analysis. The gesture detection and segmentation module combines EfficientNet and Transformer models. It fuses shallow and deep features through skip connections, realizes multi-scale feature extraction, and enhances attention to the target area through the triplet attention module. Additionally, the paper introduces the local texture feature prior (RTHLBP) to optimize gesture recognition and segmentation accuracy. In the gesture classification module, the paper proposes a ViT architecture based on a masked autoencoder. It aligns features at different levels through a dynamic weight fusion strategy and combines the relative total variation map as a self-monitoring element. This significantly improves classification performance. Experimental results demonstrate that the proposed model's accuracy, F1 score, and MIoU on the 4 GTEA sub-dataset surpass those of traditional CNN, Transformer, MobileNet, and DenseNet models, particularly on small datasets. The paper also optimizes the model's real-time performance through a multi-core parallel computing strategy. Experiments show that as the number of DSP cores increases, the computation time is significantly reduced and the computational efficiency remains at a high level.
本研究提出了一种基于Swin Transformer和掩码自编码器的实时增强现实手势交互算法。该算法解决了传统Transformer模型在时空特征提取和实时性能方面的挑战。在数据预处理过程中,该研究采用合成数据标注方法自动生成3D手势图像并标注关节信息,显著提高了数据标注效率。通过加权欧几里得距离和结构相似性优化,本文提出了一种基于最大后验概率的图像去噪模型,有效降低了手势图像分析中的噪声干扰。手势检测与分割模块结合了EfficientNet和Transformer模型。它通过跳跃连接融合浅层和深层特征,实现多尺度特征提取,并通过三元组注意力模块增强对目标区域的关注。此外,本文引入局部纹理特征先验(RTHLBP)来优化手势识别和分割精度。在手势分类模块中,本文提出了一种基于掩码自动编码器的ViT架构。它通过动态权重融合策略对齐不同层次的特征,并结合相对全变差图作为自监督元素。这显著提高了分类性能。实验结果表明,所提出模型在4个GTEA子数据集上的准确率、F1分数和MIoU超过了传统的CNN、Transformer、MobileNet和DenseNet模型,特别是在小数据集上。本文还通过多核并行计算策略优化了模型的实时性能。实验表明,随着DSP核数量的增加,计算时间显著减少,计算效率保持在较高水平。