• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

一种使用分区时间掩码和Swin Transformer 3D卷积模型的视觉语音识别新方法。

A Novel Approach for Visual Speech Recognition Using the Partition-Time Masking and Swin Transformer 3D Convolutional Model.

作者信息

Zhang Xiangliang, Hu Yu, Liu Xiangzhi, Gu Yu, Li Tong, Yin Jibin, Liu Tao

机构信息

The State Key Laboratory of Fluid Power and Mechatronic Systems, School of Mechanical Engineering, Zhejiang University, Hangzhou 310027, China.

Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, China.

出版信息

Sensors (Basel). 2025 Apr 8;25(8):2366. doi: 10.3390/s25082366.

DOI:10.3390/s25082366
PMID:40285055
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12031098/
Abstract

Visual speech recognition is a technology that relies on visual information, offering unique advantages in noisy environments or when communicating with individuals with speech impairments. However, this technology still faces challenges, such as limited generalization ability due to different speech habits, high recognition error rates caused by confusable phonemes, and difficulties adapting to complex lighting conditions and facial occlusions. This paper proposes a lip reading data augmentation method-Partition-Time Masking (PTM)-to address these challenges and improve lip reading models' performance and generalization ability. Applying nonlinear transformations to the training data enhances the model's generalization ability when handling diverse speakers and environmental conditions. A lip-reading recognition model architecture, Swin Transformer and 3D Convolution (ST3D), was designed to overcome the limitations of traditional lip-reading models that use ResNet-based front-end feature extraction networks. By adopting a strategy that combines Swin Transformer and 3D convolution, the proposed model enhances performance. To validate the effectiveness of the Partition-Time Masking data augmentation method, experiments were conducted on the LRW video dataset using the DC-TCN model, achieving a peak accuracy of 92.15%. The ST3D model was validated on the LRW and LRW1000 video datasets, achieving a maximum accuracy of 56.1% on the LRW1000 dataset and 91.8% on the LRW dataset, outperforming current mainstream lip reading models and demonstrating superior performance on challenging easily confused samples.

摘要

视觉语音识别是一种依赖视觉信息的技术,在嘈杂环境中或与有言语障碍的人交流时具有独特优势。然而,这项技术仍面临挑战,比如因不同言语习惯导致泛化能力有限、易混淆音素造成的高识别错误率,以及适应复杂光照条件和面部遮挡的困难。本文提出一种唇读数据增强方法——分区时间掩蔽(PTM),以应对这些挑战并提高唇读模型的性能和泛化能力。对训练数据应用非线性变换可增强模型在处理不同说话者和环境条件时的泛化能力。设计了一种唇读识别模型架构,即Swin Transformer和3D卷积(ST3D),以克服使用基于ResNet的前端特征提取网络的传统唇读模型的局限性。通过采用结合Swin Transformer和3D卷积的策略,所提出的模型提高了性能。为验证分区时间掩蔽数据增强方法的有效性,使用DC-TCN模型在LRW视频数据集上进行了实验,峰值准确率达到92.15%。ST3D模型在LRW和LRW1000视频数据集上得到验证,在LRW1000数据集上的最高准确率为56.1%,在LRW数据集上为91.8%,优于当前主流唇读模型,并在具有挑战性的易混淆样本上表现出卓越性能。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/54d7/12031098/327322ca6678/sensors-25-02366-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/54d7/12031098/331fd1fc6d39/sensors-25-02366-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/54d7/12031098/9441b0169d35/sensors-25-02366-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/54d7/12031098/327322ca6678/sensors-25-02366-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/54d7/12031098/331fd1fc6d39/sensors-25-02366-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/54d7/12031098/9441b0169d35/sensors-25-02366-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/54d7/12031098/327322ca6678/sensors-25-02366-g003.jpg

相似文献

1
A Novel Approach for Visual Speech Recognition Using the Partition-Time Masking and Swin Transformer 3D Convolutional Model.一种使用分区时间掩码和Swin Transformer 3D卷积模型的视觉语音识别新方法。
Sensors (Basel). 2025 Apr 8;25(8):2366. doi: 10.3390/s25082366.
2
Enhanced Pneumonia Detection in Chest X-Rays Using Hybrid Convolutional and Vision Transformer Networks.使用混合卷积和视觉Transformer网络增强胸部X光片中的肺炎检测
Curr Med Imaging. 2025;21:e15734056326685. doi: 10.2174/0115734056326685250101113959.
3
End-to-End Lip-Reading Open Cloud-Based Speech Architecture.端到端唇读开放云语音架构。
Sensors (Basel). 2022 Apr 12;22(8):2938. doi: 10.3390/s22082938.
4
Prompt Tuning of Deep Neural Networks for Speaker-Adaptive Visual Speech Recognition.用于说话人自适应视觉语音识别的深度神经网络提示调优
IEEE Trans Pattern Anal Mach Intell. 2025 Feb;47(2):1042-1055. doi: 10.1109/TPAMI.2024.3484658. Epub 2025 Jan 9.
5
Small object detection algorithm incorporating swin transformer for tea buds.用于茶芽的融合 Swin 变换小目标检测算法。
PLoS One. 2024 Mar 21;19(3):e0299902. doi: 10.1371/journal.pone.0299902. eCollection 2024.
6
Swin-HSTPS: Research on Target Detection Algorithms for Multi-Source High-Resolution Remote Sensing Images.Swin-HSTPS:多源高分遥感图像目标检测算法研究。
Sensors (Basel). 2021 Dec 4;21(23):8113. doi: 10.3390/s21238113.
7
A deep learning approach to direct immunofluorescence pattern recognition in autoimmune bullous diseases.深度学习方法在自身免疫性大疱性疾病中的直接免疫荧光模式识别。
Br J Dermatol. 2024 Jul 16;191(2):261-266. doi: 10.1093/bjd/ljae142.
8
Speech Emotion Recognition Using Convolution Neural Networks and Multi-Head Convolutional Transformer.基于卷积神经网络和多头卷积变换的语音情感识别。
Sensors (Basel). 2023 Jul 7;23(13):6212. doi: 10.3390/s23136212.
9
MelTrans: Mel-Spectrogram Relationship-Learning for Speech Emotion Recognition via Transformers.基于 Transformer 的梅尔频谱关系学习在语音情感识别中的应用。
Sensors (Basel). 2024 Aug 25;24(17):5506. doi: 10.3390/s24175506.
10
Temporal-based Swin Transformer network for workflow recognition of surgical video.用于手术视频工作流识别的基于时间的Swin Transformer网络
Int J Comput Assist Radiol Surg. 2023 Jan;18(1):139-147. doi: 10.1007/s11548-022-02785-y. Epub 2022 Nov 4.

引用本文的文献

1
Automated UPDRS Gait Scoring Using Wearable Sensor Fusion and Deep Learning.使用可穿戴传感器融合与深度学习的自动统一帕金森病评定量表步态评分
Bioengineering (Basel). 2025 Jun 24;12(7):686. doi: 10.3390/bioengineering12070686.

本文引用的文献

1
EEG-based emotion recognition with autoencoder feature fusion and MSC-TimesNet model.基于脑电图的情感识别与自动编码器特征融合及MSC-TimesNet模型
Comput Methods Biomech Biomed Engin. 2025 Mar 17:1-18. doi: 10.1080/10255842.2025.2477801.
2
Lipreading Architecture Based on Multiple Convolutional Neural Networks for Sentence-Level Visual Speech Recognition.基于多个卷积神经网络的句子级唇读识别架构。
Sensors (Basel). 2021 Dec 23;22(1):72. doi: 10.3390/s22010072.
3
Speech rehabilitation in post-stroke aphasia using visual illustration of speech articulators: A case report study.
使用言语发音器官视觉图示对脑卒中后失语症进行言语康复:一项病例报告研究。
Clin Linguist Phon. 2021 Mar 4;35(3):253-276. doi: 10.1080/02699206.2020.1780473. Epub 2020 Jun 22.
4
Multimodal Speech Capture System for Speech Rehabilitation and Learning.用于言语康复与学习的多模态语音捕捉系统
IEEE Trans Biomed Eng. 2017 Nov;64(11):2639-2649. doi: 10.1109/TBME.2017.2654361. Epub 2017 Jan 18.
5
Audiovisual integration and lipreading abilities of older adults with normal and impaired hearing.听力正常和受损的老年人的视听整合与唇读能力。
Ear Hear. 2007 Sep;28(5):656-68. doi: 10.1097/AUD.0b013e31812f7185.
6
An audio-visual corpus for speech perception and automatic speech recognition.一个用于语音感知和自动语音识别的视听语料库。
J Acoust Soc Am. 2006 Nov;120(5 Pt 1):2421-4. doi: 10.1121/1.2229005.