Visual Computing Lab at Information Technologies Institute of Centre for Research and Technology Hellas, VCL of CERTH/ITI Hellas, 57001 Thessaloniki, Greece.
Sensors (Basel). 2021 Apr 1;21(7):2437. doi: 10.3390/s21072437.
Continuous sign language recognition is a weakly supervised task dealing with the identification of continuous sign gestures from video sequences, without any prior knowledge about the temporal boundaries between consecutive signs. Most of the existing methods focus mainly on the extraction of spatio-temporal visual features without exploiting text or contextual information to further improve the recognition accuracy. Moreover, the ability of deep generative models to effectively model data distribution has not been investigated yet in the field of sign language recognition. To this end, a novel approach for context-aware continuous sign language recognition using a generative adversarial network architecture, named as Sign Language Recognition Generative Adversarial Network (SLRGAN), is introduced. The proposed network architecture consists of a generator that recognizes sign language glosses by extracting spatial and temporal features from video sequences, as well as a discriminator that evaluates the quality of the generator's predictions by modeling text information at the sentence and gloss levels. The paper also investigates the importance of contextual information on sign language conversations for both Deaf-to-Deaf and Deaf-to-hearing communication. Contextual information, in the form of hidden states extracted from the previous sentence, is fed into the bidirectional long short-term memory module of the generator to improve the recognition accuracy of the network. At the final stage, sign language translation is performed by a transformer network, which converts sign language glosses to natural language text. Our proposed method achieved word error rates of 23.4%, 2.1% and 2.26% on the RWTH-Phoenix-Weather-2014 and the Chinese Sign Language (CSL) and Greek Sign Language (GSL) Signer Independent (SI) datasets, respectively.
连续手语识别是一项弱监督任务,涉及从视频序列中识别连续手语手势,而无需有关连续手势之间的时间边界的任何先验知识。现有的大多数方法主要侧重于提取时空视觉特征,而没有利用文本或上下文信息来进一步提高识别准确性。此外,深度生成模型有效地对数据分布进行建模的能力尚未在手语识别领域进行研究。为此,引入了一种使用生成对抗网络架构的基于上下文感知的连续手语识别新方法,称为手语识别生成对抗网络(SLRGAN)。所提出的网络架构由生成器组成,该生成器通过从视频序列中提取空间和时间特征来识别手语释义,以及鉴别器,该鉴别器通过在句子和释义级别建模文本信息来评估生成器预测的质量。本文还研究了上下文信息在手语对话中对聋人对聋人和聋人对听力交流的重要性。以从前一句话中提取的隐藏状态的形式提供上下文信息,并将其输入到生成器的双向长短期记忆模块中,以提高网络的识别精度。在最后阶段,通过转换器网络对手语释义进行翻译,该网络将手语释义转换为自然语言文本。我们的方法在 RWTH-Phoenix-Weather-2014 数据集以及中文手语(CSL)和希腊手语(GSL)签名者独立(SI)数据集上的词错误率分别为 23.4%,2.1%和 2.26%。