基于传感器网络中手语识别的多模态时空特征融合与3D可变形卷积

Fusion of Multimodal Spatio-Temporal Features and 3D Deformable Convolution Based on Sign Language Recognition in Sensor Networks.

作者信息

Zhou Qian, Li Hui, Meng Weizhi, Dai Hua, Zhou Tianyu, Zheng Guineng

机构信息

School of Computer Science, Nanjing University of Posts and Telecommunications, 9 Wenyuan Road, Nanjing 210023, China.

School of Ccomputing and Communications, Lancaster University, Lancaster LA1 4YR, UK.

出版信息

Sensors (Basel). 2025 Jul 13;25(14):4378. doi: 10.3390/s25144378.

DOI:10.3390/s25144378

PMID:40732506

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12298227/

Abstract

Sign language is a complex and dynamic visual language that requires the coordinated movement of various body parts, such as the hands, arms, and limbs-making it an ideal application domain for sensor networks to capture and interpret human gestures accurately. To address the intricate task of precise and expedient SLR from raw videos, this study introduces a novel deep learning approach by devising a multimodal framework for SLR. Specifically, feature extraction models are built based on two modalities: skeleton and RGB images. In this paper, we firstly propose a Multi-Stream Spatio-Temporal Graph Convolutional Network (MSGCN) that relies on three modules: a decoupling graph convolutional network, a self-emphasizing temporal convolutional network, and a spatio-temporal joint attention module. These modules are combined to capture the spatio-temporal information in multi-stream skeleton features. Secondly, we propose a 3D ResNet model based on deformable convolution (D-ResNet) to model complex spatial and temporal sequences in the original raw images. Finally, a gating mechanism-based Multi-Stream Fusion Module (MFM) is employed to merge the results of the two modalities. Extensive experiments are conducted on the public datasets AUTSL and WLASL, achieving competitive results compared to state-of-the-art systems.

摘要

手语是一种复杂且动态的视觉语言，需要手部、手臂和四肢等身体各部位的协调运动，这使其成为传感器网络精确捕捉和解读人类手势的理想应用领域。为了处理从原始视频中进行精确且高效的手语识别这一复杂任务，本研究通过设计一种用于手语识别的多模态框架，引入了一种新颖的深度学习方法。具体而言，基于骨骼和RGB图像这两种模态构建特征提取模型。在本文中，我们首先提出了一种多流时空图卷积网络（MSGCN），它依赖于三个模块：解耦图卷积网络、自增强时间卷积网络和时空联合注意力模块。这些模块相结合以捕捉多流骨骼特征中的时空信息。其次，我们提出了一种基于可变形卷积的3D ResNet模型（D-ResNet），用于对原始图像中的复杂空间和时间序列进行建模。最后，采用基于门控机制的多流融合模块（MFM）来融合这两种模态的结果。在公开数据集AUTSL和WLASL上进行了大量实验，与现有最先进系统相比取得了具有竞争力的结果。