Boudelaa Sami, Carreiras Manuel, Jariya Nazrin, Perea Manuel
Department of Cognitive Sciences, United Arab Emirates University, Al Ain, 15551, UAE.
Basque Center for Cognition, Brain, and Language, Donostia-San Sebastián, Spain.
Behav Res Methods. 2025 Feb 26;57(4):104. doi: 10.3758/s13428-024-02560-8.
This article presents SUBTLEX-AR, a digital database providing an extensive collection of attributes related to Modern Standard Arabic words (Arabic for short). SUBTLEX-AR combines a novel dataset of 120 million word tokens from movie subtitles with 40 million tokens from newspaper articles originally collected in ARALEX (Boudelaa & Marslen-Wilson, Behavior Research Methods, 42, 481-487, 2010), ensuring comprehensive coverage. SUBTLEX-AR provides information about the statistical properties of Arabic words at the orthographic, phonological, morphological, and semantic levels. The database also includes information on sub-word structure properties like bigram and trigram frequencies, as well as lemmas and part-of-speech information along with their corresponding frequencies. The online interface of SUBTLEX-AR allows users either to upload a set of words to receive their properties or to receive a set of words matching constraints on predefined properties. The properties themselves are easily extensible and will be expanded over time. SUBTLEX-AR is freely accessible here: https://subtlexar.uaeu.ac.ae/.
本文介绍了SUBTLEX-AR,这是一个数字数据库,提供了与现代标准阿拉伯语单词(以下简称阿拉伯语)相关的大量属性集合。SUBTLEX-AR将一个包含1.2亿个单词标记的新数据集(来自电影字幕)与原本在ARALEX中收集的4000万个来自报纸文章的单词标记相结合(布德拉亚和马斯伦-威尔逊,《行为研究方法》,第42卷,第481 - 487页,2010年),确保了全面覆盖。SUBTLEX-AR提供了阿拉伯语单词在正字法、音系学、形态学和语义层面的统计属性信息。该数据库还包括关于子词结构属性的信息,如双字母组和三字母组频率,以及词元、词性信息及其相应频率。SUBTLEX-AR的在线界面允许用户上传一组单词以获取其属性,或者获取一组符合预定义属性约束的单词。这些属性本身易于扩展,并且会随着时间推移而扩充。可通过以下链接免费访问SUBTLEX-AR:https://subtlexar.uaeu.ac.ae/ 。