Botarleanu Robert-Mihai, Dascalu Mihai, Watanabe Micah, Crossley Scott Andrew, McNamara Danielle S
University Politehnica of Bucharest, Bucharest, Romania.
Academy of Romanian Scientists, Bucharest, Romania.
Behav Res Methods. 2022 Dec;54(6):3015-3042. doi: 10.3758/s13428-022-01797-5. Epub 2022 Feb 15.
Age of acquisition (AoA) is a measure of word complexity which refers to the age at which a word is typically learned. AoA measures have shown strong correlations with reading comprehension, lexical decision times, and writing quality. AoA scores based on both adult and child data have limitations that allow for error in measurement, and increase the cost and effort to produce. In this paper, we introduce Age of Exposure (AoE) version 2, a proxy for human exposure to new vocabulary terms that expands AoA word lists through training regressors to predict AoA scores. Word2vec word embeddings are trained on cumulatively increasing corpora of texts, word exposure trajectories are generated by aligning the word2vec vector spaces, and features of words are derived for modeling AoA scores. Our prediction models achieve low errors (from 13% with a corresponding R of .35 up to 7% with an R of .74), can be uniformly applied to different AoA word lists, and generalize to the entire vocabulary of a language. Our method benefits from using existing readability indices to define the order of texts in the corpora, while the performed analyses confirm that the generated AoA scores accurately predicted the difficulty of texts (R of .84, surpassing related previous work). Further, we provide evidence of the internal reliability of our word trajectory features, demonstrate the effectiveness of the word trajectory features when contrasted with simple lexical features, and show that the exclusion of features that rely on external resources does not significantly impact performance.
习得年龄(AoA)是衡量单词复杂度的一个指标,它指的是一个单词通常被习得的年龄。习得年龄测量结果已显示出与阅读理解、词汇判断时间和写作质量之间存在很强的相关性。基于成人和儿童数据的习得年龄分数存在局限性,会导致测量误差,并且会增加生成的成本和工作量。在本文中,我们引入了暴露年龄(AoE)版本2,这是一种衡量人类接触新词汇的指标,它通过训练回归器来预测习得年龄分数,从而扩展了习得年龄单词列表。Word2vec词嵌入在不断累积增加的文本语料库上进行训练,通过对齐Word2vec向量空间生成单词暴露轨迹,并导出单词特征以对习得年龄分数进行建模。我们的预测模型误差较低(从相关系数R为0.35时的13%到相关系数R为0.74时的7%),可以统一应用于不同的习得年龄单词列表,并能推广到一种语言的整个词汇表。我们的方法受益于使用现有的可读性指标来定义语料库中文章的顺序,同时所进行的分析证实,生成的习得年龄分数能够准确预测文章的难度(相关系数R为0.84,超过了之前的相关工作)。此外,我们提供了单词轨迹特征内部可靠性的证据,证明了单词轨迹特征与简单词汇特征相比的有效性,并表明排除依赖外部资源的特征不会对性能产生显著影响。