Futrell Richard, Gibson Edward, Tily Harry J, Blank Idan, Vishnevetsky Anastasia, Piantadosi Steven T, Fedorenko Evelina
University of California, Irvine, USA.
Massachusetts Institute of Technology, Cambridge , USA.
Lang Resour Eval. 2021;55(1):63-77. doi: 10.1007/s10579-020-09503-7. Epub 2020 Sep 4.
It is now a common practice to compare models of human language processing by comparing how well they predict behavioral and neural measures of processing difficulty, such as reading times, on corpora of rich naturalistic linguistic materials. However, many of these corpora, which are based on naturally-occurring text, do not contain many of the low-frequency syntactic constructions that are often required to distinguish between processing theories. Here we describe a new corpus consisting of English texts edited to contain many low-frequency syntactic constructions while still sounding fluent to native speakers. The corpus is annotated with hand-corrected Penn Treebank-style parse trees and includes self-paced reading time data and aligned audio recordings. We give an overview of the content of the corpus, review recent work using the corpus, and release the data.
目前,通过比较人类语言处理模型对处理难度的行为和神经测量指标(如阅读时间)的预测能力,来比较这些模型已成为一种常见做法,这些指标是基于丰富的自然语言材料语料库得出的。然而,许多基于自然出现文本的语料库并不包含许多区分处理理论所需的低频句法结构。在这里,我们描述了一个新的语料库,它由编辑后的英语文本组成,包含许多低频句法结构,同时对以英语为母语的人来说听起来仍然很流畅。该语料库带有手工校正的宾夕法尼亚树库风格的句法剖析树注释,包括自定步速阅读时间数据和对齐的音频记录。我们概述了该语料库的内容,回顾了使用该语料库的近期研究工作,并发布了这些数据。