Suppr超能文献

SUBTLEX-AR:基于电影字幕的阿拉伯语词汇分布特征

SUBTLEX-AR: Arabic word distributional characteristics based on movie subtitles.

作者信息

Boudelaa Sami, Carreiras Manuel, Jariya Nazrin, Perea Manuel

机构信息

Department of Cognitive Sciences, United Arab Emirates University, Al Ain, 15551, UAE.

Basque Center for Cognition, Brain, and Language, Donostia-San Sebastián, Spain.

出版信息

Behav Res Methods. 2025 Feb 26;57(4):104. doi: 10.3758/s13428-024-02560-8.

Abstract

This article presents SUBTLEX-AR, a digital database providing an extensive collection of attributes related to Modern Standard Arabic words (Arabic for short). SUBTLEX-AR combines a novel dataset of 120 million word tokens from movie subtitles with 40 million tokens from newspaper articles originally collected in ARALEX (Boudelaa & Marslen-Wilson, Behavior Research Methods, 42, 481-487, 2010), ensuring comprehensive coverage. SUBTLEX-AR provides information about the statistical properties of Arabic words at the orthographic, phonological, morphological, and semantic levels. The database also includes information on sub-word structure properties like bigram and trigram frequencies, as well as lemmas and part-of-speech information along with their corresponding frequencies. The online interface of SUBTLEX-AR allows users either to upload a set of words to receive their properties or to receive a set of words matching constraints on predefined properties. The properties themselves are easily extensible and will be expanded over time. SUBTLEX-AR is freely accessible here: https://subtlexar.uaeu.ac.ae/.

摘要

本文介绍了SUBTLEX-AR,这是一个数字数据库,提供了与现代标准阿拉伯语单词(以下简称阿拉伯语)相关的大量属性集合。SUBTLEX-AR将一个包含1.2亿个单词标记的新数据集(来自电影字幕)与原本在ARALEX中收集的4000万个来自报纸文章的单词标记相结合(布德拉亚和马斯伦-威尔逊,《行为研究方法》,第42卷,第481 - 487页,2010年),确保了全面覆盖。SUBTLEX-AR提供了阿拉伯语单词在正字法、音系学、形态学和语义层面的统计属性信息。该数据库还包括关于子词结构属性的信息,如双字母组和三字母组频率,以及词元、词性信息及其相应频率。SUBTLEX-AR的在线界面允许用户上传一组单词以获取其属性,或者获取一组符合预定义属性约束的单词。这些属性本身易于扩展,并且会随着时间推移而扩充。可通过以下链接免费访问SUBTLEX-AR:https://subtlexar.uaeu.ac.ae/

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验