用于实时增强现实手势交互的具有掩码自动编码的局部模式感知3D视频斯温变压器

Local pattern aware 3D video swin transformer with masked autoencoding for realtime augmented reality gesture interaction.

作者信息

Wang Suli

机构信息

Faculty of Data Science, City University of Macau, Taipa, 999078, Macau, China.

School of Computer Engineering, Guangzhou City University of Technology, Guangzhou, 510800, China.

出版信息

Sci Rep. 2025 Jul 1;15(1):21318. doi: 10.1038/s41598-025-05935-9.

DOI:10.1038/s41598-025-05935-9

PMID:40594635

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12218122/

Abstract

This study proposes a real-time augmented reality gesture interaction algorithm based on the Swin Transformer and a masked self-encoder. This algorithm solves the challenges of the traditional Transformer model regarding spatio-temporal feature extraction and real-time performance. During data preprocessing, the study uses a synthetic data annotation method to automatically generate 3D gesture images and annotate joint information, significantly improving data annotation efficiency. Using weighted Euclidean distance and structural similarity optimization, the paper proposes an image denoising model based on maximum a posteriori probability that effectively reduces noise interference in gesture image analysis. The gesture detection and segmentation module combines EfficientNet and Transformer models. It fuses shallow and deep features through skip connections, realizes multi-scale feature extraction, and enhances attention to the target area through the triplet attention module. Additionally, the paper introduces the local texture feature prior (RTHLBP) to optimize gesture recognition and segmentation accuracy. In the gesture classification module, the paper proposes a ViT architecture based on a masked autoencoder. It aligns features at different levels through a dynamic weight fusion strategy and combines the relative total variation map as a self-monitoring element. This significantly improves classification performance. Experimental results demonstrate that the proposed model's accuracy, F1 score, and MIoU on the 4 GTEA sub-dataset surpass those of traditional CNN, Transformer, MobileNet, and DenseNet models, particularly on small datasets. The paper also optimizes the model's real-time performance through a multi-core parallel computing strategy. Experiments show that as the number of DSP cores increases, the computation time is significantly reduced and the computational efficiency remains at a high level.

摘要

本研究提出了一种基于Swin Transformer和掩码自编码器的实时增强现实手势交互算法。该算法解决了传统Transformer模型在时空特征提取和实时性能方面的挑战。在数据预处理过程中，该研究采用合成数据标注方法自动生成3D手势图像并标注关节信息，显著提高了数据标注效率。通过加权欧几里得距离和结构相似性优化，本文提出了一种基于最大后验概率的图像去噪模型，有效降低了手势图像分析中的噪声干扰。手势检测与分割模块结合了EfficientNet和Transformer模型。它通过跳跃连接融合浅层和深层特征，实现多尺度特征提取，并通过三元组注意力模块增强对目标区域的关注。此外，本文引入局部纹理特征先验（RTHLBP）来优化手势识别和分割精度。在手势分类模块中，本文提出了一种基于掩码自动编码器的ViT架构。它通过动态权重融合策略对齐不同层次的特征，并结合相对全变差图作为自监督元素。这显著提高了分类性能。实验结果表明，所提出模型在4个GTEA子数据集上的准确率、F1分数和MIoU超过了传统的CNN、Transformer、MobileNet和DenseNet模型，特别是在小数据集上。本文还通过多核并行计算策略优化了模型的实时性能。实验表明，随着DSP核数量的增加，计算时间显著减少，计算效率保持在较高水平。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6c8a/12218122/07b2afc27241/41598_2025_5935_Fig1_HTML.jpg

相似文献

Local pattern aware 3D video swin transformer with masked autoencoding for realtime augmented reality gesture interaction.用于实时增强现实手势交互的具有掩码自动编码的局部模式感知3D视频斯温变压器

Sci Rep. 2025 Jul 1;15(1):21318. doi: 10.1038/s41598-025-05935-9.

A fake news detection model using the integration of multimodal attention mechanism and residual convolutional network.一种融合多模态注意力机制和残差卷积网络的假新闻检测模型。

Sci Rep. 2025 Jul 1;15(1):20544. doi: 10.1038/s41598-025-05702-w.

TLTNet: A novel transscale cascade layered transformer network for enhanced retinal blood vessel segmentation.TLTNet：一种新颖的跨尺度级联分层Transformer 网络，用于增强视网膜血管分割。

Comput Biol Med. 2024 Aug;178:108773. doi: 10.1016/j.compbiomed.2024.108773. Epub 2024 Jun 25.

A deep learning approach to direct immunofluorescence pattern recognition in autoimmune bullous diseases.深度学习方法在自身免疫性大疱性疾病中的直接免疫荧光模式识别。

Br J Dermatol. 2024 Jul 16;191(2):261-266. doi: 10.1093/bjd/ljae142.

Leveraging a foundation model zoo for cell similarity search in oncological microscopy across devices.利用基础模型库进行跨设备肿瘤显微镜检查中的细胞相似性搜索。

Front Oncol. 2025 Jun 18;15:1480384. doi: 10.3389/fonc.2025.1480384. eCollection 2025.

Recognizing American Sign Language gestures efficiently and accurately using a hybrid transformer model.使用混合变压器模型高效准确地识别美国手语手势。

Sci Rep. 2025 Jun 23;15(1):20253. doi: 10.1038/s41598-025-06344-8.

Gesture recognition for hearing impaired people using an ensemble of deep learning models with improving beluga whale optimization-based hyperparameter tuning.基于改进的白鲸优化超参数调优的深度学习模型集成用于听力障碍者的手势识别

Sci Rep. 2025 Jul 1;15(1):21441. doi: 10.1038/s41598-025-06680-9.

A dual-branch deep learning model based on fNIRS for assessing 3D visual fatigue.一种基于功能近红外光谱技术的双分支深度学习模型，用于评估三维视觉疲劳。

Front Neurosci. 2025 Jun 5;19:1589152. doi: 10.3389/fnins.2025.1589152. eCollection 2025.

Unsupervised retinal image registration based on D-STUNet and progressive keypoint screening strategy.基于D-STUNet和渐进式关键点筛选策略的无监督视网膜图像配准

Biomed Phys Eng Express. 2025 Jul 9;11(4). doi: 10.1088/2057-1976/ade9c6.

DGCFNet: Dual Global Context Fusion Network for remote sensing image semantic segmentation.DGCFNet：用于遥感图像语义分割的双全局上下文融合网络

PeerJ Comput Sci. 2025 Mar 27;11:e2786. doi: 10.7717/peerj-cs.2786. eCollection 2025.

本文引用的文献

Dual-3DMAD: Mixed Transformer Based Semantic Segmentation and Triplet Pre-Processing for Early Multi-Class Alzheimer's Diagnosis.双重 3DMAD：基于混合 Transformer 的语义分割和三重预处理的早期多类阿尔茨海默病诊断。

IEEE Trans Neural Syst Rehabil Eng. 2024;32:696-707. doi: 10.1109/TNSRE.2024.3357723. Epub 2024 Feb 8.

An Exploration into Human-Computer Interaction: Hand Gesture Recognition Management in a Challenging Environment.人机交互探索：挑战性环境中的手势识别管理

SN Comput Sci. 2023;4(5):441. doi: 10.1007/s42979-023-01751-y. Epub 2023 Jun 12.

MEMS Devices-Based Hand Gesture Recognition via Wearable Computing.基于MEMS器件的可穿戴计算手势识别

Micromachines (Basel). 2023 Apr 27;14(5):947. doi: 10.3390/mi14050947.

Contextual Transformer Networks for Visual Recognition.用于视觉识别的上下文Transformer网络

IEEE Trans Pattern Anal Mach Intell. 2023 Feb;45(2):1489-1500. doi: 10.1109/TPAMI.2022.3164083. Epub 2023 Jan 6.

Dynamic gesture recognition based on 2D convolutional neural network and feature fusion.基于二维卷积神经网络和特征融合的动态手势识别。

Sci Rep. 2022 Mar 14;12(1):4345. doi: 10.1038/s41598-022-08133-z.

Dynamic Gesture Recognition Using Surface EMG Signals Based on Multi-Stream Residual Network.基于多流残差网络的表面肌电信号动态手势识别

Front Bioeng Biotechnol. 2021 Oct 22;9:779353. doi: 10.3389/fbioe.2021.779353. eCollection 2021.

Dynamic Hand Gesture Recognition in In-Vehicle Environment Based on FMCW Radar and Transformer.基于 FMCW 雷达和转换器的车载环境下动态手势识别

Sensors (Basel). 2021 Sep 24;21(19):6368. doi: 10.3390/s21196368.

A Conceptual Model and Taxonomy for Collaborative Augmented Reality.协作式增强现实的概念模型和分类法。

IEEE Trans Vis Comput Graph. 2022 Dec;28(12):5113-5133. doi: 10.1109/TVCG.2021.3101545. Epub 2022 Oct 26.

Constrained transformer network for ECG signal processing and arrhythmia classification.受约束的变压器网络在心电图信号处理和心律失常分类中的应用。

BMC Med Inform Decis Mak. 2021 Jun 9;21(1):184. doi: 10.1186/s12911-021-01546-2.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

用于实时增强现实手势交互的具有掩码自动编码的局部模式感知3D视频斯温变压器

Local pattern aware 3D video swin transformer with masked autoencoding for realtime augmented reality gesture interaction.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

本文引用的文献