Suppr超能文献

《儿童图画书词汇表》(CPB-LEX):一个来自儿童图画书的大规模词汇数据库。

The Children's Picture Books Lexicon (CPB-LEX): A large-scale lexical database from children's picture books.

机构信息

Faculty of Education, University of Hong Kong, Pok Fu Lam, Hong Kong.

Senior Lecturer, Centre for Smart Analytics & Institute of Innovation, Science and Sustainability, Federation University Australia, Mount Helen, Australia.

出版信息

Behav Res Methods. 2024 Aug;56(5):4504-4521. doi: 10.3758/s13428-023-02198-y. Epub 2023 Aug 11.

Abstract

This article presents CPB-LEX, a large-scale database of lexical statistics derived from children's picture books (age range 0-8 years). Such a database is essential for research in psychology, education and computational modelling, where rich details on the vocabulary of early print exposure are required. CPB-LEX was built through an innovative method of computationally extracting lexical information from automatic speech-to-text captions and subtitle tracks generated from social media channels dedicated to reading picture books aloud. It consists of approximately 25,585 types (wordforms) and their frequency norms (raw and Zipf-transformed), a lexicon of bigrams (two-word sequences and their transitional probabilities) and a document-term matrix (which shows the importance of each word in the corpus in each book). Several immediate contributions of CPB-LEX to behavioural science research are reported, including that the new CPB-LEX frequency norms strongly predict age of acquisition and outperform comparable child-input lexical databases. The database allows researchers and practitioners to extract lexical statistics for high-frequency words which can be used to develop word lists. The paper concludes with an investigation of how CPB-LEX can be used to extend recent modelling research on the lexical diversity children receive from picture books in addition to child-directed speech. Our model shows that the vocabulary input from a relatively small number of picture books can dramatically enrich vocabulary exposure from child-directed speech and potentially assist children with vocabulary input deficits. The database is freely available from the Open Science Framework repository: https://tinyurl.com/4este73c .

摘要

本文介绍了 CPB-LEX,这是一个从儿童图画书中提取词汇统计数据的大型数据库(年龄范围为 0-8 岁)。对于心理学、教育和计算建模等领域的研究来说,这样的数据库是必不可少的,因为这些领域需要有关早期印刷品接触词汇的丰富细节。CPB-LEX 是通过一种创新的方法从社交媒体渠道上专门用于大声朗读图画书的自动语音转文本字幕和副标题轨道中提取词汇信息而构建的。它由大约 25585 个类型(词形)及其频率规范(原始和 Zipf 转换)、一个双词序列(两个词的序列及其过渡概率)的词汇和一个文档-术语矩阵(显示每个词在语料库中在每本书中的重要性)组成。本文报告了 CPB-LEX 对行为科学研究的几个直接贡献,包括新的 CPB-LEX 频率规范强烈预测习得年龄,并优于可比的儿童输入词汇数据库。该数据库允许研究人员和从业者提取高频词的词汇统计信息,可用于开发词表。本文最后探讨了如何使用 CPB-LEX 来扩展最近关于儿童从图画书和儿童导向语言中获得词汇多样性的建模研究。我们的模型表明,从相对较少的图画书中输入的词汇可以极大地丰富来自儿童导向语言的词汇输入,并可能有助于词汇输入不足的儿童。该数据库可从开放科学框架存储库中免费获得:https://tinyurl.com/4este73c

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8b58/11289352/7790aa3c77d2/13428_2023_2198_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验