Institute for Creative Technologies, University of Southern California, CA 90292, USA.
J Child Lang. 2010 Jun;37(3):705-29. doi: 10.1017/S0305000909990407. Epub 2010 Mar 25.
Corpora of child language are essential for research in child language acquisition and psycholinguistics. Linguistic annotation of the corpora provides researchers with better means for exploring the development of grammatical constructions and their usage. We describe a project whose goal is to annotate the English section of the CHILDES database with grammatical relations in the form of labeled dependency structures. We have produced a corpus of over 18,800 utterances (approximately 65,000 words) with manually curated gold-standard grammatical relation annotations. Using this corpus, we have developed a highly accurate data-driven parser for the English CHILDES data, which we used to automatically annotate the remainder of the English section of CHILDES. We have also extended the parser to Spanish, and are currently working on supporting more languages. The parser and the manually and automatically annotated data are freely available for research purposes.
儿童语言语料库对于儿童语言习得和心理语言学研究至关重要。对语料库进行语言注释为研究人员探索语法结构的发展及其用法提供了更好的手段。我们描述了一个项目,该项目的目标是使用标记的依存结构形式对 CHILDES 数据库的英语部分进行语法关系标注。我们已经创建了一个包含超过 18800 个句子(约 65000 个单词)的语料库,这些句子都经过人工精心整理并具有黄金标准的语法关系标注。使用这个语料库,我们开发了一个针对英语 CHILDES 数据的高度准确的数据驱动解析器,我们使用该解析器自动标注 CHILDES 的英语部分的其余部分。我们还将解析器扩展到西班牙语,并正在努力支持更多语言。解析器以及手动和自动标注的数据都可供研究使用。