基于骨骼的手语识别的多线索时间建模

Multi-cue temporal modeling for skeleton-based sign language recognition.

作者信息

Özdemir Oğulcan, Baytaş İnci M, Akarun Lale

机构信息

Perceptual Intelligence Laboratory, Computer Engineering Department, Boğaziçi University, Istanbul, Türkiye.

出版信息

Front Neurosci. 2023 Apr 5;17:1148191. doi: 10.3389/fnins.2023.1148191. eCollection 2023.

DOI:10.3389/fnins.2023.1148191

PMID:37090797

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10113557/

Abstract

Sign languages are visual languages used as the primary communication medium for the Deaf community. The signs comprise manual and non-manual articulators such as hand shapes, upper body movement, and facial expressions. Sign Language Recognition (SLR) aims to learn spatial and temporal representations from the videos of the signs. Most SLR studies focus on manual features often extracted from the shape of the dominant hand or the entire frame. However, facial expressions combined with hand and body gestures may also play a significant role in discriminating the context represented in the sign videos. In this study, we propose an isolated SLR framework based on Spatial-Temporal Graph Convolutional Networks (ST-GCNs) and Multi-Cue Long Short-Term Memorys (MC-LSTMs) to exploit multi-articulatory (e.g., body, hands, and face) information for recognizing sign glosses. We train an ST-GCN model for learning representations from the upper body and hands. Meanwhile, spatial embeddings of hand shape and facial expression cues are extracted from Convolutional Neural Networks (CNNs) pre-trained on large-scale hand and facial expression datasets. Thus, the proposed framework coupling ST-GCNs with MC-LSTMs for multi-articulatory temporal modeling can provide insights into the contribution of each visual Sign Language (SL) cue to recognition performance. To evaluate the proposed framework, we conducted extensive analyzes on two Turkish SL benchmark datasets with different linguistic properties, BosphorusSign22k and AUTSL. While we obtained comparable recognition performance with the skeleton-based state-of-the-art, we observe that incorporating multiple visual SL cues improves the recognition performance, especially in certain sign classes where multi-cue information is vital. The code is available at: https://github.com/ogulcanozdemir/multicue-slr.

摘要

手语是聋人社区用作主要交流媒介的视觉语言。这些手势包括手动和非手动发音器官，如手型、上身动作和面部表情。手语识别（SLR）旨在从手语视频中学习空间和时间表示。大多数SLR研究集中在通常从优势手的形状或整个帧中提取的手动特征上。然而，面部表情与手部和身体手势相结合，在区分手语视频中所呈现的语境时也可能发挥重要作用。在本研究中，我们提出了一种基于时空图卷积网络（ST-GCN）和多线索长短期记忆（MC-LSTM）的孤立手语识别框架，以利用多发音器官（如身体、手和面部）信息来识别手语词汇。我们训练一个ST-GCN模型，用于从上身和手部学习表示。同时，从在大规模手部和面部表情数据集上预训练的卷积神经网络（CNN）中提取手型和面部表情线索的空间嵌入。因此，所提出的将ST-GCN与MC-LSTM相结合用于多发音器官时间建模的框架，可以深入了解每个视觉手语（SL）线索对识别性能的贡献。为了评估所提出的框架，我们在两个具有不同语言属性的土耳其手语基准数据集BosphorusSign22k和AUTSL上进行了广泛分析。虽然我们获得了与基于骨架的最新技术相当的识别性能，但我们观察到纳入多个视觉手语线索可提高识别性能，特别是在某些多线索信息至关重要的手语类别中。代码可在以下网址获取：https://github.com/ogulcanozdemir/multicue-slr 。