Department of Education, University of Oxford, Oxford, UK.
Department of Speech and Hearing, Manipal College of Health Professions, Manipal Academy of Higher Education, Manipal, India.
Behav Res Methods. 2024 Apr;56(4):2751-2764. doi: 10.3758/s13428-024-02339-x. Epub 2024 Feb 15.
Child-directed print corpora enable systematic psycholinguistic investigations, but this research infrastructure is not available in many understudied languages. Moreover, researchers of understudied languages are dependent on manual tagging because precise automatized parsers are not yet available. One plausible way forward is to limit the intensive work to a small-sized corpus. However, with little systematic enquiry about approaches to corpus construction, it is unclear how robust a small corpus can be made. The current study examines the potential of a non-sequential sampling protocol for small corpus development (NSP-SCD) through a cross-corpora and within-corpus analysis. A corpus comprising 17,584 words was developed by applying the protocol to a larger corpus of 150,595 words from children's books for 3-to-10-year-olds. While the larger corpus will by definition have more instances of unique words and unique orthographic units, still, the selectively sampled small corpus approximated the larger corpus for lexical and orthographic diversity and was equivalent for orthographic representation and word length. Psycholinguistic complexity increased by book level and varied by parts of speech. Finally, in a robustness check of lexical diversity, the non-sequentially sampled small corpus was more efficient compared to a same-sized corpus constructed by simply using all sentences from a few books (402 books vs. seven books). If a small corpus must be used then non-sequential sampling from books stratified by book level makes the corpus statistics better approximate what is found in larger corpora. Overall, the protocol shows promise as a tool to advance the science of child language acquisition in understudied languages.
儿童导向的印刷语料库使系统的心理语言学研究成为可能,但这种研究基础设施在许多研究不足的语言中并不存在。此外,研究这些语言的研究人员依赖于手动标记,因为还没有精确的自动化解析器。一种可行的方法是将密集的工作限制在一个小规模的语料库中。然而,由于对语料库构建方法缺乏系统的研究,因此尚不清楚如何使一个小规模的语料库变得稳健。本研究通过跨语料库和语料内分析,检查了非连续抽样协议用于小型语料库开发(NSP-SCD)的潜力。通过将协议应用于更大的儿童书籍语料库(包含 150595 个单词,适用于 3 至 10 岁的儿童),开发了一个包含 17584 个单词的语料库。虽然较大的语料库从定义上会有更多的独特单词和独特的正字法单位,但经过选择性抽样的小型语料库仍然可以近似于更大的语料库的词汇和正字法多样性,并且在正字法表示和单词长度方面是等效的。心理语言学复杂性随书籍级别增加而增加,并随词性变化而变化。最后,在词汇多样性的稳健性检查中,与使用少数几本书中的所有句子简单构建的相同大小的语料库相比,非连续抽样的小型语料库在词汇多样性方面效率更高(402 本书与 7 本书)。如果必须使用小型语料库,那么通过按书籍级别分层的书籍进行非连续抽样可以使语料库统计数据更好地近似于在更大的语料库中发现的情况。总体而言,该协议有望成为在研究不足的语言中推进儿童语言习得科学的工具。