Hearing Systems, Department of Health Technology, Technical University of Denmark, Kgs. Lyngby, Denmark.
Department of Applied Mathematics and Computer Science, Technical University of Denmark, Kgs. Lyngby, Denmark.
PLoS Comput Biol. 2022 Jul 19;18(7):e1010273. doi: 10.1371/journal.pcbi.1010273. eCollection 2022 Jul.
Temporal synchrony between facial motion and acoustic modulations is a hallmark feature of audiovisual speech. The moving face and mouth during natural speech is known to be correlated with low-frequency acoustic envelope fluctuations (below 10 Hz), but the precise rates at which envelope information is synchronized with motion in different parts of the face are less clear. Here, we used regularized canonical correlation analysis (rCCA) to learn speech envelope filters whose outputs correlate with motion in different parts of the speakers face. We leveraged recent advances in video-based 3D facial landmark estimation allowing us to examine statistical envelope-face correlations across a large number of speakers (∼4000). Specifically, rCCA was used to learn modulation transfer functions (MTFs) for the speech envelope that significantly predict correlation with facial motion across different speakers. The AV analysis revealed bandpass speech envelope filters at distinct temporal scales. A first set of MTFs showed peaks around 3-4 Hz and were correlated with mouth movements. A second set of MTFs captured envelope fluctuations in the 1-2 Hz range correlated with more global face and head motion. These two distinctive timescales emerged only as a property of natural AV speech statistics across many speakers. A similar analysis of fewer speakers performing a controlled speech task highlighted only the well-known temporal modulations around 4 Hz correlated with orofacial motion. The different bandpass ranges of AV correlation align notably with the average rates at which syllables (3-4 Hz) and phrases (1-2 Hz) are produced in natural speech. Whereas periodicities at the syllable rate are evident in the envelope spectrum of the speech signal itself, slower 1-2 Hz regularities thus only become prominent when considering crossmodal signal statistics. This may indicate a motor origin of temporal regularities at the timescales of syllables and phrases in natural speech.
面部运动和声学调制之间的时间同步是视听语音的一个显著特征。众所周知,在自然语音中,运动的脸和嘴与低频声包络波动(低于 10 Hz)相关,但包络信息与面部不同部位运动同步的确切速率尚不清楚。在这里,我们使用正则化典型相关分析(rCCA)来学习语音包络滤波器,其输出与说话者面部不同部位的运动相关。我们利用基于视频的 3D 面部地标估计的最新进展,允许我们检查大量说话者(约 4000 个)的面部运动和语音包络之间的统计相关性。具体来说,rCCA 用于学习语音包络的调制传递函数(MTF),这些函数显著预测了不同说话者之间与面部运动的相关性。视听分析揭示了在不同时间尺度上具有带通的语音包络滤波器。第一组 MTF 显示出约 3-4 Hz 的峰值,与口部运动相关。第二组 MTF 捕获了与更全局的面部和头部运动相关的 1-2 Hz 范围内的包络波动。这两个独特的时间尺度仅作为许多说话者的自然视听统计数据的属性出现。对执行受控语音任务的较少说话者进行类似的分析,仅突出了与口面部运动相关的约 4 Hz 的已知时间调制。视听相关的不同带通范围与自然语音中音节(3-4 Hz)和短语(1-2 Hz)产生的平均速率显著对齐。虽然音节率的周期性在语音信号本身的包络谱中显而易见,但当考虑跨模态信号统计时,较慢的 1-2 Hz 规律性才变得明显。这可能表明自然语音中音节和短语时间规律的起源是运动的。