MANULEX：一个来自法国小学读物的年级词汇数据库。

MANULEX: a grade-level lexical database from French elementary school readers.

作者信息

Lété Bernard, Sprenger-Charolles Liliane, Colé Pascale

机构信息

INRP, CNRS (UMR 6057) and Université de Provence, Aix-en-Provence, France.

出版信息

Behav Res Methods Instrum Comput. 2004 Feb;36(1):156-66. doi: 10.3758/bf03195560.

DOI:10.3758/bf03195560

PMID:15190710

Abstract

This article presents MANULEX, a Web-accessible database that provides grade-level word frequency lists of nonlemmatized and lemmatized words (48,886 and 23,812 entries, respectively) computed from the 1.9 million words taken from 54 French elementary school readers. Word frequencies are provided for four levels: first grade (G1), second grade (G2), third to fifth grades (G3-5), and all grades (G1-5). The frequencies were computed following the methods described by Carroll, Davies, and Richman (1971) and Zeno, Ivenz, Millard, and Duvvuri (1995), with four statistics at each level (F, overall word frequency; D, index of dispersion across the selected readers; U, estimated frequency per million words; and SFI, standard frequency index). The database also provides the number of letters in the word and syntactic category information. MANULEX is intended to be a useful tool for studying language development through the selection of stimuli based on precise frequency norms. Researchers in artificial intelligence can also use it as a source of information on natural language processing to simulate written language acquisition in children. Finally, it may serve an educational purpose by providing basic vocabulary lists.

摘要

本文介绍了MANULEX，这是一个可通过网络访问的数据库，它提供了从54本法国小学读物中选取的190万个单词计算得出的非词元化和词元化单词的年级词汇频率列表（分别有48,886个和23,812个条目）。提供了四个级别的单词频率：一年级（G1）、二年级（G2）、三年级至五年级（G3 - 5）以及所有年级（G1 - 5）。频率是按照卡罗尔、戴维斯和里奇曼（1971年）以及泽诺、伊文兹、米勒德和杜夫武里（1995年）所描述的方法计算的，每个级别有四个统计数据（F，总单词频率；D，所选读物中分散度指数；U，每百万单词估计频率；以及SFI，标准频率指数）。该数据库还提供单词的字母数量和句法类别信息。MANULEX旨在成为一个有用的工具，通过基于精确的频率规范选择刺激来研究语言发展。人工智能研究人员也可以将其用作自然语言处理的信息来源，以模拟儿童书面语言习得。最后，它可以通过提供基本词汇列表来服务于教育目的。