Badawi Soran, Saeed Ari M, Ahmed Sara A, Abdalla Peshraw Ahmed, Hassan Diyari A
Language Center, Charmo University, KRG, Chamchamal, Kurdistan, Iraq.
Computer Science Department, University of Halabja, KRG, Halabja, Kurdistan, Iraq.
Data Brief. 2023 Apr 13;48:109120. doi: 10.1016/j.dib.2023.109120. eCollection 2023 Jun.
The rapid growth of technology has massively increased the amount of text data. The data can be mined and utilized for numerous natural language processing (NLP) tasks, particularly text classification. The core part of text classification is collecting the data for predicting a good model. This paper collects Kurdish News Dataset Headlines (KNDH) for text classification. The dataset consists of 50000 news headlines which are equally distributed among five classes, with 10000 headlines for each class (Social, Sport, Health, Economic, and Technology). The percentage ratio of getting the channels of headlines is distinct, while the numbers of samples are equal for each category. There are 34 distinct channels that are used to collect the different headlines for each class, such as 8 channels for economics, 14 channels for health, 18 channels for science, 15 channels for social, and 5 channels for sport. The dataset is preprocessed using the Kurdish Language Processing Toolkit (KLPT) for tokenizing, spell-checking, stemming, and preprocessing.
技术的快速发展极大地增加了文本数据的数量。这些数据可用于挖掘并应用于众多自然语言处理(NLP)任务,尤其是文本分类。文本分类的核心部分是收集数据以预测一个良好的模型。本文收集了库尔德语新闻数据集标题(KNDH)用于文本分类。该数据集由50000条新闻标题组成,这些标题平均分布在五个类别中,每个类别有10000条标题(社会、体育、健康、经济和科技)。获取标题渠道的百分比比例各不相同,而每个类别的样本数量相等。有34个不同的渠道用于为每个类别收集不同的标题,例如经济类有8个渠道,健康类有14个渠道,科学类有18个渠道,社会类有15个渠道,体育类有5个渠道。该数据集使用库尔德语处理工具包(KLPT)进行预处理,用于分词、拼写检查、词干提取和预处理。