使用卷积神经网络（CNN）和循环神经网络（RNN）架构进行手语识别的高效时空建模。

Efficient spatio-temporal modeling for sign language recognition using CNN and RNN architectures.

作者信息

Myagila Kasian, Nyambo Devotha Godfrey, Dida Mussa Ally

机构信息

School of Computation and Communication Science and Engineering, The Nelson Mandela African Institution of Science and Technology, Arusha, Tanzania.

Faculty of Science and Technology, Mzumbe University, Morogoro, Tanzania.

出版信息

Front Artif Intell. 2025 Aug 25;8:1630743. doi: 10.3389/frai.2025.1630743. eCollection 2025.

DOI:10.3389/frai.2025.1630743

PMID:40927705

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12415044/

Abstract

Computer vision has been identified as one of the solutions to bridge communication barriers between speech-impaired populations and those without impairment as most people are unaware of the sign language used by speech-impaired individuals. Numerous studies have been conducted to address this challenge. However, recognizing word signs, which are usually dynamic and involve more than one frame per sign, remains a challenge. This study used Tanzania Sign Language datasets collected using mobile phone selfie cameras to investigate the performance of deep learning algorithms that capture spatial and temporal relationships features of video frames. The study used CNN-LSTM and CNN-GRU architectures, where CNN-GRU with an ELU activation function is proposed to enhance learning efficiency and performance. The findings indicate that the proposed CNN-GRU model with ELU activation achieved an accuracy of 94%, compared to 93% for the standard CNN-GRU model and CNN-LSTM. In addition, the study evaluated performance of the proposed model in a signer-independent setting, where the results varied significantly across individual signers, with the highest accuracy reaching 66%. These results show that more effort is required to improve signer independence performance, including the challenges of hand dominance by optimizing spatial features.

摘要

计算机视觉已被视为解决语言障碍人群与非语言障碍人群之间沟通障碍的解决方案之一，因为大多数人不了解语言障碍者使用的手语。为应对这一挑战，已经开展了大量研究。然而，识别单词手势（通常是动态的，每个手势涉及多个帧）仍然是一项挑战。本研究使用通过手机自拍相机收集的坦桑尼亚手语数据集，来研究捕捉视频帧空间和时间关系特征的深度学习算法的性能。该研究使用了CNN-LSTM和CNN-GRU架构，其中提出了具有ELU激活函数的CNN-GRU以提高学习效率和性能。研究结果表明，所提出的具有ELU激活的CNN-GRU模型的准确率达到了94%，而标准CNN-GRU模型和CNN-LSTM的准确率为93%。此外，该研究在独立于手语者的环境中评估了所提出模型的性能，结果在不同手语者之间差异很大，最高准确率达到66%。这些结果表明，需要付出更多努力来提高独立于手语者的性能，包括通过优化空间特征来应对手的优势问题带来的挑战。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

使用卷积神经网络（CNN）和循环神经网络（RNN）架构进行手语识别的高效时空建模。

Efficient spatio-temporal modeling for sign language recognition using CNN and RNN architectures.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

使用卷积神经网络（CNN）和循环神经网络（RNN）架构进行手语识别的高效时空建模。

Efficient spatio-temporal modeling for sign language recognition using CNN and RNN architectures.

作者信息

机构信息

出版信息

相似文献

本文引用的文献