Rostam Payman Sabr, Nabi Rebwar Mala
Information Technology Department, Technical College of Informatics, Sulaimani Polytechnic University, Sulaimani, Kurdistan Region, Iraq.
Information Technology Department, Kurdistan Technical Institute, Sulaimani, Kurdistan Region, Iraq.
Data Brief. 2025 Jun 25;61:111839. doi: 10.1016/j.dib.2025.111839. eCollection 2025 Aug.
This Research presents the first-ever, high-quality, automatically annotated Kurdish stance detection dataset in the Sorani dialect to fill the gap of lacking annotated resources for Kurdish, a low-resource language in Natural Language Processing (NLP). The dataset consists of 2,174 Kurdish news articles-1,410 economic and 764 political-that were originally published in 2024 and 2025, which are recent and topically relevant. By selecting these texts from well-known Kurdish news agencies, content validity and linguistic purity were preserved throughout. Necessary preprocessing techniques are applied. Annotation is carried out in two steps. First, a pattern-recognition method with 2,456 phrases and keywords was applied to determine if the subject of every text fell into the economics or politics category. Next, the position of every article was annotated with an extended lexicon of 4,243 adjectives and verbs, categorized under support, oppose, and neutral. Wherever direct matches were not possible, semantic similarity and zero-shot classification were used as fallback measures. In order to verify the automatic annotation, a team of domain experts manually assessed a representative sample of the annotated texts, with a high inter-annotator agreement score confirming the validity of the approach. The dataset is made available in XLSX (Excel) format, facilitating ease of use and versatility for a variety of research tasks in NLP. Due to its annotated and organized corpus, this dataset is a solid starting point for researchers who are building Kurdish language processing models. The dataset is released publicly to allow other researchers to build upon it and push the limits of NLP system performance on low-resource languages.
本研究呈现了首个高质量的、自动注释的索拉尼方言库尔德语立场检测数据集,以填补自然语言处理(NLP)中低资源语言库尔德语缺乏注释资源的空白。该数据集由2174篇库尔德语新闻文章组成——1410篇经济类和764篇政治类——这些文章最初发表于2024年和2025年,既新颖又与主题相关。通过从知名库尔德语新闻机构中挑选这些文本,全程保持了内容的有效性和语言的纯正性。应用了必要的预处理技术。注释分两步进行。首先,应用一种包含2456个短语和关键词的模式识别方法,以确定每篇文本的主题属于经济类还是政治类。接下来,用一个由4243个形容词和动词组成的扩展词汇表对每篇文章的立场进行注释,这些词汇分为支持、反对和中立三类。在无法直接匹配的情况下,使用语义相似性和零样本分类作为备用措施。为了验证自动注释,一组领域专家手动评估了注释文本的代表性样本,注释者间的高度一致性分数证实了该方法的有效性。该数据集以XLSX(Excel)格式提供,便于在NLP中的各种研究任务中使用且具有通用性。由于其经过注释和整理的语料库,该数据集是构建库尔德语处理模型的研究人员的坚实起点。该数据集已公开发布,以便其他研究人员在此基础上进行拓展,并推动NLP系统在低资源语言上的性能极限。