Department of Cybernetics and New Technologies for the Information Society, University of West Bohemia, Technická 8, 301 00 Pilsen, Czech Republic.
Gymnasium of Johannes Kepler, Parléřova 2/118, 169 00 Prague, Czech Republic.
Sensors (Basel). 2022 Jul 4;22(13):5043. doi: 10.3390/s22135043.
In this paper, we dive into sign language recognition, focusing on the recognition of isolated signs. The task is defined as a classification problem, where a sequence of frames (i.e., images) is recognized as one of the given sign language glosses. We analyze two appearance-based approaches, I3D and TimeSformer, and one pose-based approach, SPOTER. The appearance-based approaches are trained on a few different data modalities, whereas the performance of SPOTER is evaluated on different types of preprocessing. All the methods are tested on two publicly available datasets: AUTSL and WLASL300. We experiment with ensemble techniques to achieve new state-of-the-art results of 73.84% accuracy on the WLASL300 dataset by using the CMA-ES optimization method to find the best ensemble weight parameters. Furthermore, we present an ensembling technique based on the Transformer model, which we call Neural Ensembler.
在本文中,我们深入研究了手语识别,重点是孤立手语的识别。这项任务被定义为一个分类问题,即将一序列的帧(即图像)识别为给定的手语词汇之一。我们分析了两种基于外观的方法,I3D 和 TimeSformer,以及一种基于姿势的方法 SPOTER。基于外观的方法可以在多种不同的数据模态上进行训练,而 SPOTER 的性能则可以在不同类型的预处理上进行评估。所有方法都在两个公开可用的数据集 AUTSL 和 WLASL300 上进行了测试。我们尝试了集成技术,通过使用 CMA-ES 优化方法来寻找最佳的集成权重参数,在 WLASL300 数据集上达到了 73.84%的新的最先进的准确率。此外,我们提出了一种基于 Transformer 模型的集成技术,我们称之为神经集成器(Neural Ensembler)。