Einea Omar, Elnagar Ashraf, Al Debsi Ridhwan
University of Sharjah, United Arab Emirates.
Data Brief. 2019 Jun 4;25:104076. doi: 10.1016/j.dib.2019.104076. eCollection 2019 Aug.
Text Classification is one of the most popular Natural Language Processing (NLP) tasks. Text classification (aka categorization) is an active research topic in recent years. However, much less attention was directed towards this task in Arabic, due to the lack of rich representative resources for training an Arabic text classifier. Therefore, we introduce a large Single-labeled Arabic News Articles Dataset (SANAD) of textual data collected from three news portals. The dataset is a large one consisting of almost 200k articles distributed into seven categories that we offer to the research community on Arabic computational linguistics. We anticipate that this rich dataset would make a great aid for a variety of NLP tasks on Modern Standard Arabic (MSA) textual data, especially for single label text classification purposes. We present the data in raw form. SANAD is composed of three main datasets scraped from three news portals, which are AlKhaleej, AlArabiya, and Akhbarona. SANAD is made public and freely available at https://data.mendeley.com/datasets/57zpx667y9.
文本分类是最受欢迎的自然语言处理(NLP)任务之一。文本分类(又名归类)是近年来一个活跃的研究课题。然而,由于缺乏用于训练阿拉伯语文本分类器的丰富代表性资源,针对阿拉伯语该任务的关注要少得多。因此,我们引入了一个从三个新闻门户网站收集的大型单标签阿拉伯语新闻文章数据集(SANAD)。该数据集规模庞大,由近20万篇文章组成,分为七个类别,我们将其提供给阿拉伯语计算语言学研究社区。我们预计,这个丰富的数据集将极大地有助于处理现代标准阿拉伯语(MSA)文本数据的各种NLP任务,特别是用于单标签文本分类目的。我们以原始形式呈现数据。SANAD由从三个新闻门户网站(即《半岛报》、《阿拉伯电视台》和《今日消息报》)抓取的三个主要数据集组成。SANAD已公开并可在https://data.mendeley.com/datasets/57zpx667y9免费获取。