O'Sullivan Aisling E, Crosse Michael J, Di Liberto Giovanni M, Lalor Edmund C
School of Engineering, Trinity College DublinDublin, Ireland; Trinity Centre for Bioengineering, Trinity College DublinDublin, Ireland.
Department of Pediatrics and Department of Neuroscience, Albert Einstein College of Medicine Bronx, NY, USA.
Front Hum Neurosci. 2017 Jan 11;10:679. doi: 10.3389/fnhum.2016.00679. eCollection 2016.
Speech is a multisensory percept, comprising an auditory and visual component. While the content and processing pathways of audio speech have been well characterized, the visual component is less well understood. In this work, we expand current methodologies using system identification to introduce a framework that facilitates the study of visual speech in its natural, continuous form. Specifically, we use models based on the unheard acoustic envelope (E), the motion signal (M) and categorical visual speech features (V) to predict EEG activity during silent lipreading. Our results show that each of these models performs similarly at predicting EEG in visual regions and that respective combinations of the individual models (EV, MV, EM and EMV) provide an improved prediction of the neural activity over their constituent models. In comparing these different combinations, we find that the model incorporating all three types of features (EMV) outperforms the individual models, as well as both the EV and MV models, while it performs similarly to the EM model. Importantly, EM does not outperform EV and MV, which, considering the higher dimensionality of the V model, suggests that more data is needed to clarify this finding. Nevertheless, the performance of EMV, and comparisons of the subject performances for the three individual models, provides further evidence to suggest that visual regions are involved in both low-level processing of stimulus dynamics and categorical speech perception. This framework may prove useful for investigating modality-specific processing of visual speech under naturalistic conditions.
言语是一种多感官感知,包括听觉和视觉成分。虽然音频言语的内容和处理路径已得到充分表征,但视觉成分的理解却较少。在这项工作中,我们扩展了当前使用系统识别的方法,引入了一个有助于以自然、连续形式研究视觉言语的框架。具体而言,我们使用基于未听到的声学包络(E)、运动信号(M)和分类视觉言语特征(V)的模型来预测默读唇读期间的脑电图活动。我们的结果表明,这些模型中的每一个在预测视觉区域的脑电图方面表现相似,并且各个模型的组合(EV、MV、EM和EMV)在预测神经活动方面比其组成模型有所改进。在比较这些不同组合时,我们发现包含所有三种特征类型的模型(EMV)优于单个模型以及EV和MV模型,同时其表现与EM模型相似。重要的是,EM并不优于EV和MV,考虑到V模型的维度更高,这表明需要更多数据来阐明这一发现。尽管如此,EMV的性能以及三个单个模型的受试者表现比较,提供了进一步的证据表明视觉区域参与了刺激动态的低级处理和分类言语感知。这个框架可能被证明对研究自然条件下视觉言语的模态特异性处理有用。