SANAD：用于自动文本分类的单标签阿拉伯语新闻文章数据集。

SANAD: Single-label Arabic News Articles Dataset for automatic text categorization.

作者信息

Einea Omar, Elnagar Ashraf, Al Debsi Ridhwan

机构信息

University of Sharjah, United Arab Emirates.

出版信息

Data Brief. 2019 Jun 4;25:104076. doi: 10.1016/j.dib.2019.104076. eCollection 2019 Aug.

DOI:10.1016/j.dib.2019.104076

PMID:31440535

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6700340/

Abstract

Text Classification is one of the most popular Natural Language Processing (NLP) tasks. Text classification (aka categorization) is an active research topic in recent years. However, much less attention was directed towards this task in Arabic, due to the lack of rich representative resources for training an Arabic text classifier. Therefore, we introduce a large Single-labeled Arabic News Articles Dataset (SANAD) of textual data collected from three news portals. The dataset is a large one consisting of almost 200k articles distributed into seven categories that we offer to the research community on Arabic computational linguistics. We anticipate that this rich dataset would make a great aid for a variety of NLP tasks on Modern Standard Arabic (MSA) textual data, especially for single label text classification purposes. We present the data in raw form. SANAD is composed of three main datasets scraped from three news portals, which are AlKhaleej, AlArabiya, and Akhbarona. SANAD is made public and freely available at https://data.mendeley.com/datasets/57zpx667y9.

摘要

文本分类是最受欢迎的自然语言处理（NLP）任务之一。文本分类（又名归类）是近年来一个活跃的研究课题。然而，由于缺乏用于训练阿拉伯语文本分类器的丰富代表性资源，针对阿拉伯语该任务的关注要少得多。因此，我们引入了一个从三个新闻门户网站收集的大型单标签阿拉伯语新闻文章数据集（SANAD）。该数据集规模庞大，由近20万篇文章组成，分为七个类别，我们将其提供给阿拉伯语计算语言学研究社区。我们预计，这个丰富的数据集将极大地有助于处理现代标准阿拉伯语（MSA）文本数据的各种NLP任务，特别是用于单标签文本分类目的。我们以原始形式呈现数据。SANAD由从三个新闻门户网站（即《半岛报》、《阿拉伯电视台》和《今日消息报》）抓取的三个主要数据集组成。SANAD已公开并可在https://data.mendeley.com/datasets/57zpx667y9免费获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/16fa/6700340/005382ebc62b/gr1.jpg

相似文献

SANAD: Single-label Arabic News Articles Dataset for automatic text categorization.SANAD：用于自动文本分类的单标签阿拉伯语新闻文章数据集。

Data Brief. 2019 Jun 4;25:104076. doi: 10.1016/j.dib.2019.104076. eCollection 2019 Aug.

AHD: Arabic healthcare dataset.AHD：阿拉伯语医疗保健数据集。

Data Brief. 2024 Aug 22;56:110855. doi: 10.1016/j.dib.2024.110855. eCollection 2024 Oct.

AFND: Arabic fake news dataset for the detection and classification of articles credibility.AFND：用于检测和分类文章可信度的阿拉伯语虚假新闻数据集。

Data Brief. 2022 Apr 8;42:108141. doi: 10.1016/j.dib.2022.108141. eCollection 2022 Jun.

ANAD: Arabic news article dataset.ANAD：阿拉伯语新闻文章数据集。

Data Brief. 2023 Jul 29;50:109460. doi: 10.1016/j.dib.2023.109460. eCollection 2023 Oct.

Arabic text classification: the need for multi-labeling systems.阿拉伯语文本分类：对多标签系统的需求。

Neural Comput Appl. 2022;34(2):1135-1159. doi: 10.1007/s00521-021-06390-z. Epub 2021 Sep 1.

Arabic Fake News Detection Based on Textual Analysis.基于文本分析的阿拉伯语假新闻检测

Arab J Sci Eng. 2022;47(8):10453-10469. doi: 10.1007/s13369-021-06449-y. Epub 2022 Feb 11.

Sanadset 650K: Data on Hadith narrators.萨纳德集650K：关于圣训传述者的数据。

Data Brief. 2022 Aug 17;44:108540. doi: 10.1016/j.dib.2022.108540. eCollection 2022 Oct.

Cursive-Text: A Comprehensive Dataset for End-to-End Urdu Text Recognition in Natural Scene Images.连笔文本：用于自然场景图像中乌尔都语文本端到端识别的综合数据集。

Data Brief. 2020 May 21;31:105749. doi: 10.1016/j.dib.2020.105749. eCollection 2020 Aug.

CLICK-ID: A novel dataset for Indonesian clickbait headlines.CLICK-ID：一个用于印尼语标题党新闻标题的新数据集。

Data Brief. 2020 Aug 27;32:106231. doi: 10.1016/j.dib.2020.106231. eCollection 2020 Oct.

An open-source dataset for arabic fine-grained emotion recognition of online content amid COVID-19 pandemic.一个用于在新冠疫情期间对在线内容进行阿拉伯语细粒度情感识别的开源数据集。

Data Brief. 2023 Oct 31;51:109745. doi: 10.1016/j.dib.2023.109745. eCollection 2023 Dec.

引用本文的文献

Open source Arabic research paper dataset for natural language processing.用于自然语言处理的开源阿拉伯语研究论文数据集。

Sci Rep. 2025 Aug 27;15(1):31631. doi: 10.1038/s41598-025-16647-5.

MuTCELM: An optimal multi-TextCNN-based ensemble learning for text classification.MuTCELM：一种基于最优多文本卷积神经网络的文本分类集成学习方法

Heliyon. 2024 Sep 30;10(19):e38515. doi: 10.1016/j.heliyon.2024.e38515. eCollection 2024 Oct 15.

Deep-GenMut: Automated genetic mutation classification in oncology: A deep learning comparative study.深度基因变异（Deep-GenMut）：肿瘤学中的自动基因突变分类：一项深度学习比较研究。

Heliyon. 2024 May 31;10(11):e32279. doi: 10.1016/j.heliyon.2024.e32279. eCollection 2024 Jun 15.

Ar-DAD: Arabic diversified audio dataset.Ar-DAD：阿拉伯语多样化音频数据集。

Data Brief. 2020 Nov 7;33:106503. doi: 10.1016/j.dib.2020.106503. eCollection 2020 Dec.

Application of BERT to Enable Gene Classification Based on Clinical Evidence.基于临床证据的基因分类中 BERT 的应用

Biomed Res Int. 2020 Oct 7;2020:5491963. doi: 10.1155/2020/5491963. eCollection 2020.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

SANAD：用于自动文本分类的单标签阿拉伯语新闻文章数据集。

SANAD: Single-label Arabic News Articles Dataset for automatic text categorization.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献