School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin 300350, China.
The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, China.
Neural Netw. 2024 Nov;179:106587. doi: 10.1016/j.neunet.2024.106587. Epub 2024 Jul 30.
Continuous Sign Language Recognition (CSLR) is a task which converts a sign language video into a gloss sequence. The existing deep learning based sign language recognition methods usually rely on large-scale training data and rich supervised information. However, current sign language datasets are limited, and they are only annotated at sentence-level rather than frame-level. Inadequate supervision of sign language data poses a serious challenge for sign language recognition, which may result in insufficient training of sign language recognition models. To address above problems, we propose a cross-modal knowledge distillation method for continuous sign language recognition, which contains two teacher models and one student model. One of the teacher models is the Sign2Text dialogue teacher model, which takes a sign language video and a dialogue sentence as input and outputs the sign language recognition result. The other teacher model is the Text2Gloss translation teacher model, which targets to translate a text sentence into a gloss sequence. Both teacher models can provide information-rich soft labels to assist the training of the student model, which is a general sign language recognition model. We conduct extensive experiments on multiple commonly used sign language datasets, i.e., PHOENIX 2014T, CSL-Daily and QSL, the results show that the proposed cross-modal knowledge distillation method can effectively improve the sign language recognition accuracy by transferring multi-modal information from teacher models to the student model. Code is available at https://github.com/glq-1992/cross-modal-knowledge-distillation_new.
连续手语识别 (CSLR) 是将手语视频转换为手语词汇序列的任务。现有的基于深度学习的手语识别方法通常依赖于大规模的训练数据和丰富的监督信息。然而,当前的手语数据集有限,并且仅在句子级别而不是在帧级别进行注释。手语数据监督不足对手语识别构成了严重挑战,这可能导致手语识别模型训练不足。为了解决上述问题,我们提出了一种用于连续手语识别的跨模态知识蒸馏方法,该方法包含两个教师模型和一个学生模型。其中一个教师模型是 Sign2Text 对话教师模型,它以手语视频和对话句子作为输入,并输出手语识别结果。另一个教师模型是 Text2Gloss 翻译教师模型,旨在将文本句子翻译成手语词汇序列。两个教师模型都可以提供信息丰富的软标签,以辅助学生模型的训练,学生模型是一个通用的手语识别模型。我们在多个常用的手语数据集上进行了广泛的实验,即 PHOENIX 2014T、CSL-Daily 和 QSL,实验结果表明,所提出的跨模态知识蒸馏方法可以通过从教师模型向学生模型传输多模态信息来有效提高手语识别精度。代码可在 https://github.com/glq-1992/cross-modal-knowledge-distillation_new 获得。