自然故事语料库：一个包含罕见句法结构的英语文本阅读时间语料库。

The Natural Stories corpus: a reading-time corpus of English texts containing rare syntactic constructions.

作者信息

Futrell Richard, Gibson Edward, Tily Harry J, Blank Idan, Vishnevetsky Anastasia, Piantadosi Steven T, Fedorenko Evelina

机构信息

University of California, Irvine, USA.

Massachusetts Institute of Technology, Cambridge , USA.

出版信息

Lang Resour Eval. 2021;55(1):63-77. doi: 10.1007/s10579-020-09503-7. Epub 2020 Sep 4.

DOI:10.1007/s10579-020-09503-7

PMID:34720781

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8549930/

Abstract

It is now a common practice to compare models of human language processing by comparing how well they predict behavioral and neural measures of processing difficulty, such as reading times, on corpora of rich naturalistic linguistic materials. However, many of these corpora, which are based on naturally-occurring text, do not contain many of the low-frequency syntactic constructions that are often required to distinguish between processing theories. Here we describe a new corpus consisting of English texts edited to contain many low-frequency syntactic constructions while still sounding fluent to native speakers. The corpus is annotated with hand-corrected Penn Treebank-style parse trees and includes self-paced reading time data and aligned audio recordings. We give an overview of the content of the corpus, review recent work using the corpus, and release the data.

摘要

目前，通过比较人类语言处理模型对处理难度的行为和神经测量指标（如阅读时间）的预测能力，来比较这些模型已成为一种常见做法，这些指标是基于丰富的自然语言材料语料库得出的。然而，许多基于自然出现文本的语料库并不包含许多区分处理理论所需的低频句法结构。在这里，我们描述了一个新的语料库，它由编辑后的英语文本组成，包含许多低频句法结构，同时对以英语为母语的人来说听起来仍然很流畅。该语料库带有手工校正的宾夕法尼亚树库风格的句法剖析树注释，包括自定步速阅读时间数据和对齐的音频记录。我们概述了该语料库的内容，回顾了使用该语料库的近期研究工作，并发布了这些数据。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/238e/8549930/afb7c0587f14/10579_2020_9503_Fig1_HTML.jpg

相似文献

The Natural Stories corpus: a reading-time corpus of English texts containing rare syntactic constructions.自然故事语料库：一个包含罕见句法结构的英语文本阅读时间语料库。

Lang Resour Eval. 2021;55(1):63-77. doi: 10.1007/s10579-020-09503-7. Epub 2020 Sep 4.

Effects of Syntactic Distance and Word Order on Language Processing: An Investigation Based on a Psycholinguistic Treebank of English.句法距离和语序对语言处理的影响：基于英语心理语言学树库的研究

J Psycholinguist Res. 2022 Oct;51(5):1043-1062. doi: 10.1007/s10936-022-09878-4. Epub 2022 Apr 29.

Building a comprehensive syntactic and semantic corpus of Chinese clinical texts.构建中文临床文本的综合句法和语义语料库。

J Biomed Inform. 2017 May;69:203-217. doi: 10.1016/j.jbi.2017.04.006. Epub 2017 Apr 9.

Syntactic parsing of clinical text: guideline and corpus development with handling ill-formed sentences.临床文本的句法分析：处理不规范句子的指南和语料库开发。

J Am Med Inform Assoc. 2013 Nov-Dec;20(6):1168-77. doi: 10.1136/amiajnl-2013-001810. Epub 2013 Aug 1.

Parsing clinical text: how good are the state-of-the-art parsers?解析临床文本：最先进的解析器有多出色？

BMC Med Inform Decis Mak. 2015;15 Suppl 1(Suppl 1):S2. doi: 10.1186/1472-6947-15-S1-S2. Epub 2015 May 20.

The syntactic complexity of Russian relative clauses.俄语关系从句的句法复杂性。

J Mem Lang. 2013 Nov 1;69(4):461-496. doi: 10.1016/j.jml.2012.10.005.

Memory-based language processing: psycholinguistic research in the 1990s.基于记忆的语言处理：20世纪90年代的心理语言学研究。

Annu Rev Psychol. 1998;49:25-42. doi: 10.1146/annurev.psych.49.1.25.

Linguistically-Based Comparison of Different Approaches to Building Corpora for Text Simplification: A Case Study on Italian.基于语言学的不同文本简化语料库构建方法比较：以意大利语为例

Front Psychol. 2022 Mar 8;13:707630. doi: 10.3389/fpsyg.2022.707630. eCollection 2022.

Making psycholinguistics musical: self-paced reading time evidence for shared processing of linguistic and musical syntax.让心理语言学变得有音乐性：语言和音乐句法共享加工的自定步速阅读时间证据

Psychon Bull Rev. 2009 Apr;16(2):374-81. doi: 10.3758/16.2.374.

Minimization of dependency length in written English.书面英语中依存长度的最小化。

Cognition. 2007 Nov;105(2):300-33. doi: 10.1016/j.cognition.2006.09.011.

引用本文的文献

Single-neuron datasets for linguistic and semantic processing in the human amygdala and hippocampus.用于人类杏仁核和海马体语言及语义处理的单神经元数据集。

Sci Data. 2025 Aug 25;12(1):1482. doi: 10.1038/s41597-025-05839-3.

A systematic evaluation of Dutch large language models' surprisal estimates in sentence, paragraph and book reading.对荷兰大语言模型在句子、段落和书籍阅读中的意外度估计进行的系统评估。

Behav Res Methods. 2025 Aug 18;57(9):266. doi: 10.3758/s13428-025-02774-4.

PoTeC: A German naturalistic eye-tracking-while-reading corpus.PoTeC：一个德国阅读时自然主义眼动追踪语料库。

Behav Res Methods. 2025 Jun 30;57(8):211. doi: 10.3758/s13428-024-02536-8.

Rephrasing Messages on Demand: Effects on Speech Production in Parkinson's Disease.按需重新表述信息：对帕金森病言语产生的影响

Am J Speech Lang Pathol. 2025 Jul 10;34(4):2170-2188. doi: 10.1044/2025_AJSLP-24-00343. Epub 2025 Jun 16.

The perceptual span in dyslexic reading and visual search.诵读困难者阅读和视觉搜索的知觉广度。

Dyslexia. 2024 Nov;30(4):e1783. doi: 10.1002/dys.1783.

The Language Network Reliably "Tracks" Naturalistic Meaningful Nonverbal Stimuli.语言网络能够可靠地“追踪”自然主义的有意义非言语刺激。

Neurobiol Lang (Camb). 2024 Jun 3;5(2):385-408. doi: 10.1162/nol_a_00135. eCollection 2024.

Artificial Neural Network Language Models Predict Human Brain Responses to Language Even After a Developmentally Realistic Amount of Training.即使经过符合发育实际的训练量，人工神经网络语言模型仍能预测人类大脑对语言的反应。

Neurobiol Lang (Camb). 2024 Apr 1;5(1):43-63. doi: 10.1162/nol_a_00137. eCollection 2024.

A Deep Learning Approach to Analyzing Continuous-Time Cognitive Processes.一种用于分析连续时间认知过程的深度学习方法。

Open Mind (Camb). 2024 Mar 13;8:235-264. doi: 10.1162/opmi_a_00126. eCollection 2024.

Word Frequency and Predictability Dissociate in Naturalistic Reading.自然阅读中单词频率与可预测性相互分离。

Open Mind (Camb). 2024 Mar 5;8:177-201. doi: 10.1162/opmi_a_00119. eCollection 2024.

Large-scale evidence for logarithmic effects of word predictability on reading time.大规模证据表明，单词可预测性对阅读时间的影响呈对数关系。

Proc Natl Acad Sci U S A. 2024 Mar 5;121(10):e2307876121. doi: 10.1073/pnas.2307876121. Epub 2024 Feb 29.

本文引用的文献

The ERP response to the amount of information conveyed by words in sentences.句子中所传达的信息量对 ERP 的反应。

Brain Lang. 2015 Jan;140:1-11. doi: 10.1016/j.bandl.2014.10.006. Epub 2014 Nov 17.

The effect of word predictability on reading time is logarithmic.词的可预测性对阅读时间的影响是对数的。

Cognition. 2013 Sep;128(3):302-19. doi: 10.1016/j.cognition.2013.02.013. Epub 2013 Jun 6.

Reading time data for evaluating broad-coverage models of English sentence processing.阅读时间数据用于评估英语句子处理的广泛覆盖模型。

Behav Res Methods. 2013 Dec;45(4):1182-90. doi: 10.3758/s13428-012-0313-y.

Frequency and predictability effects in the Dundee Corpus: an eye movement analysis.邓迪语料库中的频率和可预测性效应：一项眼动分析。

Q J Exp Psychol (Hove). 2013;66(3):601-18. doi: 10.1080/17470218.2012.676054. Epub 2012 May 29.

Consequences of the serial nature of linguistic input for sentenial complexity.语言输入的序列性质对句子复杂性的影响。

Cogn Sci. 2005 Mar 4;29(2):261-90. doi: 10.1207/s15516709cog0000_7.

Insensitivity of the human sentence-processing system to hierarchical structure.人类句子处理系统对层次结构不敏感。

Psychol Sci. 2011 Jun;22(6):829-34. doi: 10.1177/0956797611409589. Epub 2011 May 17.

Flexible saccade-target selection in Chinese reading.中文阅读中灵活的扫视目标选择

Q J Exp Psychol (Hove). 2010 Apr;63(4):705-25. doi: 10.1080/17470210903114858. Epub 2009 Sep 8.

Frequency of Basic English Grammatical Structures: A Corpus Analysis.基础英语语法结构的频率：语料库分析

J Mem Lang. 2007 Oct 1;57(3):348-379. doi: 10.1016/j.jml.2007.03.002.

Data from eye-tracking corpora as evidence for theories of syntactic processing complexity.来自眼动追踪语料库的数据作为句法处理复杂性理论的证据。

Cognition. 2008 Nov;109(2):193-210. doi: 10.1016/j.cognition.2008.07.008. Epub 2008 Oct 18.

Expectation-based syntactic comprehension.基于期望的句法理解。

Cognition. 2008 Mar;106(3):1126-77. doi: 10.1016/j.cognition.2007.05.006. Epub 2007 Jul 30.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

自然故事语料库：一个包含罕见句法结构的英语文本阅读时间语料库。

The Natural Stories corpus: a reading-time corpus of English texts containing rare syntactic constructions.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献