Zhang Xiangliang, Hu Yu, Liu Xiangzhi, Gu Yu, Li Tong, Yin Jibin, Liu Tao
The State Key Laboratory of Fluid Power and Mechatronic Systems, School of Mechanical Engineering, Zhejiang University, Hangzhou 310027, China.
Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, China.
Sensors (Basel). 2025 Apr 8;25(8):2366. doi: 10.3390/s25082366.
Visual speech recognition is a technology that relies on visual information, offering unique advantages in noisy environments or when communicating with individuals with speech impairments. However, this technology still faces challenges, such as limited generalization ability due to different speech habits, high recognition error rates caused by confusable phonemes, and difficulties adapting to complex lighting conditions and facial occlusions. This paper proposes a lip reading data augmentation method-Partition-Time Masking (PTM)-to address these challenges and improve lip reading models' performance and generalization ability. Applying nonlinear transformations to the training data enhances the model's generalization ability when handling diverse speakers and environmental conditions. A lip-reading recognition model architecture, Swin Transformer and 3D Convolution (ST3D), was designed to overcome the limitations of traditional lip-reading models that use ResNet-based front-end feature extraction networks. By adopting a strategy that combines Swin Transformer and 3D convolution, the proposed model enhances performance. To validate the effectiveness of the Partition-Time Masking data augmentation method, experiments were conducted on the LRW video dataset using the DC-TCN model, achieving a peak accuracy of 92.15%. The ST3D model was validated on the LRW and LRW1000 video datasets, achieving a maximum accuracy of 56.1% on the LRW1000 dataset and 91.8% on the LRW dataset, outperforming current mainstream lip reading models and demonstrating superior performance on challenging easily confused samples.
视觉语音识别是一种依赖视觉信息的技术,在嘈杂环境中或与有言语障碍的人交流时具有独特优势。然而,这项技术仍面临挑战,比如因不同言语习惯导致泛化能力有限、易混淆音素造成的高识别错误率,以及适应复杂光照条件和面部遮挡的困难。本文提出一种唇读数据增强方法——分区时间掩蔽(PTM),以应对这些挑战并提高唇读模型的性能和泛化能力。对训练数据应用非线性变换可增强模型在处理不同说话者和环境条件时的泛化能力。设计了一种唇读识别模型架构,即Swin Transformer和3D卷积(ST3D),以克服使用基于ResNet的前端特征提取网络的传统唇读模型的局限性。通过采用结合Swin Transformer和3D卷积的策略,所提出的模型提高了性能。为验证分区时间掩蔽数据增强方法的有效性,使用DC-TCN模型在LRW视频数据集上进行了实验,峰值准确率达到92.15%。ST3D模型在LRW和LRW1000视频数据集上得到验证,在LRW1000数据集上的最高准确率为56.1%,在LRW数据集上为91.8%,优于当前主流唇读模型,并在具有挑战性的易混淆样本上表现出卓越性能。