深度视听语音识别

Deep Audio-Visual Speech Recognition.

作者信息

Afouras Triantafyllos, Chung Joon Son, Senior Andrew, Vinyals Oriol, Zisserman Andrew

出版信息

IEEE Trans Pattern Anal Mach Intell. 2022 Dec;44(12):8717-8727. doi: 10.1109/TPAMI.2018.2889052. Epub 2022 Nov 7.

DOI:10.1109/TPAMI.2018.2889052

Abstract

The goal of this work is to recognise phrases and sentences being spoken by a talking face, with or without the audio. Unlike previous works that have focussed on recognising a limited number of words or phrases, we tackle lip reading as an open-world problem - unconstrained natural language sentences, and in the wild videos. Our key contributions are: (1) we compare two models for lip reading, one using a CTC loss, and the other using a sequence-to-sequence loss. Both models are built on top of the transformer self-attention architecture; (2) we investigate to what extent lip reading is complementary to audio speech recognition, especially when the audio signal is noisy; (3) we introduce and publicly release a new dataset for audio-visual speech recognition, LRS2-BBC, consisting of thousands of natural sentences from British television. The models that we train surpass the performance of all previous work on a lip reading benchmark dataset by a significant margin.

摘要

这项工作的目标是识别有或没有音频的说话面部所讲的短语和句子。与之前专注于识别有限数量单词或短语的工作不同，我们将唇读视为一个开放世界问题——无约束的自然语言句子以及真实场景视频中的唇读。我们的主要贡献包括：（1）我们比较了两种唇读模型，一种使用连接主义时间分类（CTC）损失，另一种使用序列到序列损失。这两种模型均基于Transformer自注意力架构构建；（2）我们研究了唇读在多大程度上与音频语音识别互补，特别是在音频信号有噪声的情况下；（3）我们引入并公开发布了一个用于视听语音识别的新数据集LRS2 - BBC，它由来自英国电视台的数千个自然句子组成。我们训练的模型在唇读基准数据集上的性能大幅超越了之前所有工作的表现。

相似文献

Deep Audio-Visual Speech Recognition.

IEEE Trans Pattern Anal Mach Intell. 2022 Dec;44(12):8717-8727. doi: 10.1109/TPAMI.2018.2889052. Epub 2022 Nov 7.

Multimodal Sensor-Input Architecture with Deep Learning for Audio-Visual Speech Recognition in Wild.

Sensors (Basel). 2023 Feb 7;23(4):1834. doi: 10.3390/s23041834.

LipSound2: Self-Supervised Pre-Training for Lip-to-Speech Reconstruction and Lip Reading.

IEEE Trans Neural Netw Learn Syst. 2024 Feb;35(2):2772-2782. doi: 10.1109/TNNLS.2022.3191677. Epub 2024 Feb 5.

Improving Speech Recognition Performance in Noisy Environments by Enhancing Lip Reading Accuracy.

Sensors (Basel). 2023 Feb 11;23(4):2053. doi: 10.3390/s23042053.

An Effective Conversion of Visemes to Words for High-Performance Automatic Lipreading.

Sensors (Basel). 2021 Nov 26;21(23):7890. doi: 10.3390/s21237890.

[Intermodal timing cues for audio-visual speech recognition].

J UOEH. 2004 Jun 1;26(2):215-25. doi: 10.7888/juoeh.26.215.

Two-stage visual speech recognition for intensive care patients.

Sci Rep. 2023 Jan 17;13(1):928. doi: 10.1038/s41598-022-26155-5.

During Lipreading Training With Sentence Stimuli, Feedback Controls Learning and Generalization to Audiovisual Speech in Noise.

Am J Audiol. 2022 Mar 3;31(1):57-77. doi: 10.1044/2021_AJA-21-00034. Epub 2021 Dec 29.

Reliability-Based Large-Vocabulary Audio-Visual Speech Recognition.

Sensors (Basel). 2022 Jul 23;22(15):5501. doi: 10.3390/s22155501.

Lip-Reading Enables the Brain to Synthesize Auditory Features of Unknown Silent Speech.

J Neurosci. 2020 Jan 29;40(5):1053-1065. doi: 10.1523/JNEUROSCI.1101-19.2019. Epub 2019 Dec 30.

引用本文的文献

Automatic pain classification in older patients with hip fracture based on multimodal information fusion.

Sci Rep. 2025 Jul 1;15(1):21562. doi: 10.1038/s41598-025-09046-3.

Dual Representation Learning From Fetal Ultrasound Video And Sonographer Audio.

Proc IEEE Int Symp Biomed Imaging. 2024 May 27;2024:1-4. doi: 10.1109/ISBI56570.2024.10635693.

Finger Vein Recognition Based on Unsupervised Spiking Convolutional Neural Network with Adaptive Firing Threshold.

Sensors (Basel). 2025 Apr 3;25(7):2279. doi: 10.3390/s25072279.

Synchronous Analysis of Speech Production and Lips Movement to Detect Parkinson's Disease Using Deep Learning Methods.

Diagnostics (Basel). 2024 Dec 31;15(1):73. doi: 10.3390/diagnostics15010073.

[Research progress on electronic health records multimodal data fusion based on deep learning].

Sheng Wu Yi Xue Gong Cheng Xue Za Zhi. 2024 Oct 25;41(5):1062-1071. doi: 10.7507/1001-5515.202310011.

The meta-learning method for the ensemble model based on situational meta-task.

Front Neurorobot. 2024 Apr 26;18:1391247. doi: 10.3389/fnbot.2024.1391247. eCollection 2024.

Application of convolutional neural networks in medical images: a bibliometric analysis.

Quant Imaging Med Surg. 2024 May 1;14(5):3501-3518. doi: 10.21037/qims-23-1600. Epub 2024 Apr 11.

Audiovisual Moments in Time: A large-scale annotated dataset of audiovisual actions.

PLoS One. 2024 Apr 1;19(4):e0301098. doi: 10.1371/journal.pone.0301098. eCollection 2024.

Application of dynamic time warping optimization algorithm in speech recognition of machine translation.

Heliyon. 2023 Oct 27;9(11):e21625. doi: 10.1016/j.heliyon.2023.e21625. eCollection 2023 Nov.

A Speech Recognition Method Based on Domain-Specific Datasets and Confidence Decision Networks.

Sensors (Basel). 2023 Jun 29;23(13):6036. doi: 10.3390/s23136036.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

深度视听语音识别

Deep Audio-Visual Speech Recognition.

作者信息

Afouras Triantafyllos, Chung Joon Son, Senior Andrew, Vinyals Oriol, Zisserman Andrew

出版信息

IEEE Trans Pattern Anal Mach Intell. 2022 Dec;44(12):8717-8727. doi: 10.1109/TPAMI.2018.2889052. Epub 2022 Nov 7.

DOI:10.1109/TPAMI.2018.2889052

PMID:30582526

Abstract

摘要

深度视听语音识别

Deep Audio-Visual Speech Recognition.

作者信息

出版信息

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

深度视听语音识别

Deep Audio-Visual Speech Recognition.

作者信息

出版信息

相似文献

引用本文的文献