视听语音的调制传递函数。

Modulation transfer functions for audiovisual speech.

机构信息

Hearing Systems, Department of Health Technology, Technical University of Denmark, Kgs. Lyngby, Denmark.

Department of Applied Mathematics and Computer Science, Technical University of Denmark, Kgs. Lyngby, Denmark.

出版信息

PLoS Comput Biol. 2022 Jul 19;18(7):e1010273. doi: 10.1371/journal.pcbi.1010273. eCollection 2022 Jul.

DOI:10.1371/journal.pcbi.1010273

PMID:35852989

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9295967/

Abstract

Temporal synchrony between facial motion and acoustic modulations is a hallmark feature of audiovisual speech. The moving face and mouth during natural speech is known to be correlated with low-frequency acoustic envelope fluctuations (below 10 Hz), but the precise rates at which envelope information is synchronized with motion in different parts of the face are less clear. Here, we used regularized canonical correlation analysis (rCCA) to learn speech envelope filters whose outputs correlate with motion in different parts of the speakers face. We leveraged recent advances in video-based 3D facial landmark estimation allowing us to examine statistical envelope-face correlations across a large number of speakers (∼4000). Specifically, rCCA was used to learn modulation transfer functions (MTFs) for the speech envelope that significantly predict correlation with facial motion across different speakers. The AV analysis revealed bandpass speech envelope filters at distinct temporal scales. A first set of MTFs showed peaks around 3-4 Hz and were correlated with mouth movements. A second set of MTFs captured envelope fluctuations in the 1-2 Hz range correlated with more global face and head motion. These two distinctive timescales emerged only as a property of natural AV speech statistics across many speakers. A similar analysis of fewer speakers performing a controlled speech task highlighted only the well-known temporal modulations around 4 Hz correlated with orofacial motion. The different bandpass ranges of AV correlation align notably with the average rates at which syllables (3-4 Hz) and phrases (1-2 Hz) are produced in natural speech. Whereas periodicities at the syllable rate are evident in the envelope spectrum of the speech signal itself, slower 1-2 Hz regularities thus only become prominent when considering crossmodal signal statistics. This may indicate a motor origin of temporal regularities at the timescales of syllables and phrases in natural speech.

摘要

面部运动和声学调制之间的时间同步是视听语音的一个显著特征。众所周知，在自然语音中，运动的脸和嘴与低频声包络波动（低于 10 Hz）相关，但包络信息与面部不同部位运动同步的确切速率尚不清楚。在这里，我们使用正则化典型相关分析（rCCA）来学习语音包络滤波器，其输出与说话者面部不同部位的运动相关。我们利用基于视频的 3D 面部地标估计的最新进展，允许我们检查大量说话者（约 4000 个）的面部运动和语音包络之间的统计相关性。具体来说，rCCA 用于学习语音包络的调制传递函数（MTF），这些函数显著预测了不同说话者之间与面部运动的相关性。视听分析揭示了在不同时间尺度上具有带通的语音包络滤波器。第一组 MTF 显示出约 3-4 Hz 的峰值，与口部运动相关。第二组 MTF 捕获了与更全局的面部和头部运动相关的 1-2 Hz 范围内的包络波动。这两个独特的时间尺度仅作为许多说话者的自然视听统计数据的属性出现。对执行受控语音任务的较少说话者进行类似的分析，仅突出了与口面部运动相关的约 4 Hz 的已知时间调制。视听相关的不同带通范围与自然语音中音节（3-4 Hz）和短语（1-2 Hz）产生的平均速率显著对齐。虽然音节率的周期性在语音信号本身的包络谱中显而易见，但当考虑跨模态信号统计时，较慢的 1-2 Hz 规律性才变得明显。这可能表明自然语音中音节和短语时间规律的起源是运动的。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/275c/9295967/f7d64ddc38d1/pcbi.1010273.g001.jpg

相似文献

Modulation transfer functions for audiovisual speech.

PLoS Comput Biol. 2022 Jul 19;18(7):e1010273. doi: 10.1371/journal.pcbi.1010273. eCollection 2022 Jul.

Acoustic correlates of the syllabic rhythm of speech: Modulation spectrum or local features of the temporal envelope.

Neurosci Biobehav Rev. 2023 Apr;147:105111. doi: 10.1016/j.neubiorev.2023.105111. Epub 2023 Feb 22.

Congruent Visual Speech Enhances Cortical Entrainment to Continuous Auditory Speech in Noise-Free Conditions.

J Neurosci. 2015 Oct 21;35(42):14195-204. doi: 10.1523/JNEUROSCI.1829-15.2015.

The natural statistics of audiovisual speech.

PLoS Comput Biol. 2009 Jul;5(7):e1000436. doi: 10.1371/journal.pcbi.1000436. Epub 2009 Jul 17.

Predicted effects of sensorineural hearing loss on across-fiber envelope coding in the auditory nerve.

J Acoust Soc Am. 2011 Jun;129(6):4001-13. doi: 10.1121/1.3583502.

Effects of Visual Speech Envelope on Audiovisual Speech Perception in Multitalker Listening Environments.

J Speech Lang Hear Res. 2021 Jul 16;64(7):2845-2853. doi: 10.1044/2021_JSLHR-20-00688. Epub 2021 Jun 8.

Eye Can Hear Clearly Now: Inverse Effectiveness in Natural Audiovisual Speech Processing Relies on Long-Term Crossmodal Temporal Integration.

J Neurosci. 2016 Sep 21;36(38):9888-95. doi: 10.1523/JNEUROSCI.1396-16.2016.

Complex Mapping between Neural Response Frequency and Linguistic Units in Natural Speech.

J Cogn Neurosci. 2023 Aug 1;35(8):1361-1368. doi: 10.1162/jocn_a_02013.

Human Frequency Following Responses to Vocoded Speech.

Ear Hear. 2017 Sep/Oct;38(5):e256-e267. doi: 10.1097/AUD.0000000000000432.

Effect of reducing slow temporal modulations on speech reception.

J Acoust Soc Am. 1994 May;95(5 Pt 1):2670-80. doi: 10.1121/1.409836.

引用本文的文献

A Visual Speech Intelligibility Benefit Based on Speech Rhythm.

Brain Sci. 2023 Jun 8;13(6):932. doi: 10.3390/brainsci13060932.

本文引用的文献

Acoustically Driven Cortical δ Oscillations Underpin Prosodic Chunking.

eNeuro. 2021 Jul 9;8(4). doi: 10.1523/ENEURO.0562-20.2021. Print 2021 Jul-Aug.

Auditory stimulus-response modeling with a match-mismatch task.

J Neural Eng. 2021 May 4;18(4). doi: 10.1088/1741-2552/abf771.

Synchronous facial action binds dynamic facial features.

Sci Rep. 2021 Mar 30;11(1):7191. doi: 10.1038/s41598-021-86725-x.

The interrelationship between the face and vocal tract configuration during audiovisual speech.

Proc Natl Acad Sci U S A. 2020 Dec 22;117(51):32791-32798. doi: 10.1073/pnas.2006192117. Epub 2020 Dec 8.

Remote Heart Rate Estimation Based on 3D Facial Landmarks.

Annu Int Conf IEEE Eng Med Biol Soc. 2020 Jul;2020:2634-2637. doi: 10.1109/EMBC44109.2020.9176563.

Sequences of Intonation Units form a ~ 1 Hz rhythm.

Sci Rep. 2020 Sep 28;10(1):15846. doi: 10.1038/s41598-020-72739-4.

Theta Synchronization of Phonatory and Articulatory Systems in Marmoset Monkey Vocal Production.

Curr Biol. 2020 Nov 2;30(21):4276-4283.e3. doi: 10.1016/j.cub.2020.08.019. Epub 2020 Sep 3.

Evolution of the speech-ready brain: The voice/jaw connection in the human motor cortex.

J Comp Neurol. 2021 Apr 1;529(5):1018-1028. doi: 10.1002/cne.24997. Epub 2020 Sep 8.

Acoustic information about upper limb movement in voicing.

Proc Natl Acad Sci U S A. 2020 May 26;117(21):11364-11367. doi: 10.1073/pnas.2004163117. Epub 2020 May 11.

Speech rhythms and their neural foundations.

Nat Rev Neurosci. 2020 Jun;21(6):322-334. doi: 10.1038/s41583-020-0304-4. Epub 2020 May 6.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

视听语音的调制传递函数。

Modulation transfer functions for audiovisual speech.

机构信息

Hearing Systems, Department of Health Technology, Technical University of Denmark, Kgs. Lyngby, Denmark.

Department of Applied Mathematics and Computer Science, Technical University of Denmark, Kgs. Lyngby, Denmark.

出版信息

PLoS Comput Biol. 2022 Jul 19;18(7):e1010273. doi: 10.1371/journal.pcbi.1010273. eCollection 2022 Jul.

DOI:10.1371/journal.pcbi.1010273

PMID:35852989

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9295967/

Abstract

摘要

视听语音的调制传递函数。

Modulation transfer functions for audiovisual speech.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

视听语音的调制传递函数。

Modulation transfer functions for audiovisual speech.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献