Honda Research Institute Japan Co., Ltd., Wako-shi, Saitama 351-0188, Japan.
Faculty of Sciences and Engineering, Saarland University, 66123 Saarbrücken, Germany.
Sensors (Basel). 2020 Oct 1;20(19):5621. doi: 10.3390/s20195621.
The quality of recognition systems for continuous utterances in signed languages could be largely advanced within the last years. However, research efforts often do not address specific linguistic features of signed languages, as e.g., non-manual expressions. In this work, we evaluate the potential of a single video camera-based recognition system with respect to the latter. For this, we introduce a two-stage pipeline based on two-dimensional body joint positions extracted from RGB camera data. The system first separates the data flow of a signed expression into meaningful word segments on the base of a frame-wise binary Random Forest. Next, every segment is transformed into image-like shape and classified with a Convolutional Neural Network. The proposed system is then evaluated on a data set of continuous sentence expressions in Japanese Sign Language with a variation of non-manual expressions. Exploring multiple variations of data representations and network parameters, we are able to distinguish word segments of specific non-manual intonations with 86% accuracy from the underlying body joint movement data. Full sentence predictions achieve a total Word Error Rate of 15.75%. This marks an improvement of 13.22% as compared to ground truth predictions obtained from labeling insensitive towards non-manual content. Consequently, our analysis constitutes an important contribution for a better understanding of mixed manual and non-manual content in signed communication.
近年来,手语连续话语识别系统的识别质量得到了很大的提高。然而,研究工作往往没有针对手语的特定语言特征,例如非手语表达。在这项工作中,我们评估了基于单摄像机的识别系统在后者方面的潜力。为此,我们提出了一个基于从 RGB 相机数据中提取的二维身体关节位置的两阶段管道。该系统首先基于逐帧二值随机森林将手语表达式的数据流分为有意义的单词段。接下来,每个片段都被转换为类似图像的形状,并使用卷积神经网络进行分类。然后,我们在具有非手语表达变化的日语手语连续句子表达数据集上评估了所提出的系统。通过探索多种数据表示和网络参数的变化,我们能够以 86%的准确率从底层身体关节运动数据中区分出特定非手语语调的单词段。完整句子的预测总单词错误率为 15.75%。与不敏感于非手语内容的标签相比,这一准确率提高了 13.22%。因此,我们的分析为更好地理解手语交流中的混合手动和非手动内容做出了重要贡献。