用于连续手语识别的光泽先验引导视觉特征学习

Guo Leming, Xue Wanli, Liu Bo, Zhang Kaihua, Yuan Tiantian, Metaxas Dimitris

IEEE Trans Image Process. 2024;33:3486-3495. doi: 10.1109/TIP.2024.3404869. Epub 2024 Jun 4.

Continuous sign language recognition (CSLR) is to recognize the glosses in a sign language video. Enhancing the generalization ability of CSLR's visual feature extractor is a worthy area of investigation. In this paper, we model glosses as priors that help to learn more generalizable visual features. Specifically, the signer-invariant gloss feature is extracted by a pre-trained gloss BERT model. Then we design a gloss prior guidance network (GPGN). It contains a novel parallel densely-connected temporal feature extraction (PDC-TFE) module for multi-resolution visual feature extraction. The PDC-TFE captures the complex temporal patterns of the glosses. The pre-trained gloss feature guides the visual feature learning through a cross-modality matching loss. We propose to formulate the cross-modality feature matching into a regularized optimal transport problem, it can be efficiently solved by a variant of the Sinkhorn algorithm. The GPGN parameters are learned by optimizing a weighted sum of the cross-modality matching loss and CTC loss. The experiment results on German and Chinese sign language benchmarks demonstrate that the proposed GPGN achieves competitive performance. The ablation study verifies the effectiveness of several critical components of the GPGN. Furthermore, the proposed pre-trained gloss BERT model and cross-modality matching can be seamlessly integrated into other RGB-cue-based CSLR methods as plug-and-play formulations to enhance the generalization ability of the visual feature extractor.

连续手语识别（CSLR）旨在识别手语视频中的手语词汇。增强CSLR视觉特征提取器的泛化能力是一个值得研究的领域。在本文中，我们将手语词汇建模为先验知识，以帮助学习更具泛化性的视觉特征。具体而言，通过预训练的手语词汇BERT模型提取与手语者无关的词汇特征。然后，我们设计了一个手语词汇先验引导网络（GPGN）。它包含一个用于多分辨率视觉特征提取的新型并行密集连接时间特征提取（PDC-TFE）模块。PDC-TFE捕捉手语词汇的复杂时间模式。预训练的手语词汇特征通过跨模态匹配损失来引导视觉特征学习。我们建议将跨模态特征匹配公式化为一个正则化最优传输问题，它可以通过Sinkhorn算法的一个变体有效地解决。GPGN参数通过优化跨模态匹配损失和CTC损失的加权和来学习。在德语和中文手语基准上的实验结果表明，所提出的GPGN取得了有竞争力的性能。消融研究验证了GPGN几个关键组件的有效性。此外，所提出的预训练手语词汇BERT模型和跨模态匹配可以作为即插即用的公式无缝集成到其他基于RGB线索的CSLR方法中，以增强视觉特征提取器的泛化能力。