Department of Neuroscience, Max-Planck-Institute for Empirical Aesthetics, Frankfurt, 60322, Germany.
Department of Neurosurgery, Duke University, Durham, NC, USA, 27710.
Neuroimage. 2019 Nov 15;202:116152. doi: 10.1016/j.neuroimage.2019.116152. Epub 2019 Sep 1.
Segmenting the continuous speech stream into units for further perceptual and linguistic analyses is fundamental to speech recognition. The speech amplitude envelope (SE) has long been considered a fundamental temporal cue for segmenting speech. Does the temporal fine structure (TFS), a significant part of speech signals often considered to contain primarily spectral information, contribute to speech segmentation? Using magnetoencephalography, we show that the TFS entrains cortical responses between 3 and 6 Hz and demonstrate, using mutual information analysis, that (i) the temporal information in the TFS can be reconstructed from a measure of frame-to-frame spectral change and correlates with the SE and (ii) that spectral resolution is key to the extraction of such temporal information. Furthermore, we show behavioural evidence that, when the SE is temporally distorted, the TFS provides cues for speech segmentation and aids speech recognition significantly. Our findings show that it is insufficient to investigate solely the SE to understand temporal speech segmentation, as the SE and the TFS derived from a band-filtering method convey comparable, if not inseparable, temporal information. We argue for a more synthetic view of speech segmentation - the auditory system groups speech signals coherently in both temporal and spectral domains.
将连续的语音流分割成单元以进行进一步的感知和语言分析是语音识别的基础。语音幅度包络(SE)长期以来一直被认为是分割语音的基本时域线索。作为语音信号的重要组成部分,其通常被认为主要包含频谱信息的时频结构(TFS)是否有助于语音分割?我们使用脑磁图(MEG)表明 TFS 在 3 到 6Hz 之间引发皮质反应,并通过互信息分析证明:(i)TFS 中的时间信息可以从帧到帧的光谱变化的度量中重建,并且与 SE 相关;(ii)光谱分辨率是提取这种时间信息的关键。此外,我们还提供了行为证据,表明当 SE 受到时间扭曲时,TFS 为语音分割提供线索,并极大地帮助语音识别。我们的研究结果表明,仅研究 SE 不足以理解时间语音分割,因为 SE 和源自带通滤波方法的 TFS 传达了可比的(如果不是不可分割的)时间信息。我们主张更综合的语音分割观点——听觉系统在时域和频域中一致地对语音信号进行分组。