Department of Bioengineering and Centre for Neurotechnology, Imperial College London, London, United Kingdom.
Department of Artificial Intelligence in Biomedical Engineering, Friedrich-Alexander-University Erlangen-Nürnberg, Erlangen, Germany.
J Acoust Soc Am. 2023 May 1;153(5):3130. doi: 10.1121/10.0019460.
Seeing a speaker's face can help substantially with understanding their speech, particularly in challenging listening conditions. Research into the neurobiological mechanisms behind audiovisual integration has recently begun to employ continuous natural speech. However, these efforts are impeded by a lack of high-quality audiovisual recordings of a speaker narrating a longer text. Here, we seek to close this gap by developing AVbook, an audiovisual speech corpus designed for cognitive neuroscience studies and audiovisual speech recognition. The corpus consists of 3.6 h of audiovisual recordings of two speakers, one male and one female, each reading 59 passages from a narrative English text. The recordings were acquired at a high frame rate of 119.88 frames/s. The corpus includes phone-level alignment files and a set of multiple-choice questions to test attention to the different passages. We verified the efficacy of these questions in a pilot study. A short written summary is also provided for each recording. To enable audiovisual synchronization when presenting the stimuli, four videos of an electronic clapperboard were recorded with the corpus. The corpus is publicly available to support research into the neurobiology of audiovisual speech processing as well as the development of computer algorithms for audiovisual speech recognition.
观看说话者的面部可以极大地帮助理解他们的讲话,特别是在具有挑战性的聆听环境中。最近,对视听整合背后的神经生物学机制的研究开始采用连续的自然语音。然而,这些努力受到缺乏高质量的说话者讲述较长文本的视听录音的阻碍。在这里,我们通过开发 AVbook 来弥补这一差距,AVbook 是一个专为认知神经科学研究和视听语音识别设计的视听语音语料库。该语料库包含两名说话者(一男一女)的 3.6 小时视听录音,每位说话者朗读 59 段叙事英语文本。录音以 119.88 帧/秒的高帧率获取。语料库包括音素级别的对齐文件和一组多项选择题,以测试对不同段落的注意力。我们在一项试点研究中验证了这些问题的有效性。每个录音还提供了简短的书面摘要。为了在呈现刺激时实现视听同步,我们用该语料库录制了四个电子响板的视频。该语料库可供公众使用,以支持视听语音处理的神经生物学研究以及用于视听语音识别的计算机算法的开发。