Suppr超能文献

阿迪安:索拉尼库尔德语命名实体识别数据集的自动标注

Adyan: automated annotating named entity recognition dataset for sorani kurdish language.

作者信息

Wahid Chovyan H, Nabi Rebwar M

机构信息

Computer Department, College of Science, Charmo University, Sulaymaniyah, Iraq.

Information Technology Department, Kurdistan Technical Institute, Sulaymaniyah, Kurdistan Region, Iraq.

出版信息

Data Brief. 2025 Aug 21;62:111999. doi: 10.1016/j.dib.2025.111999. eCollection 2025 Oct.

Abstract

This paper introduces the first high-quality, automatically annotated Sorani Kurdish Named Entity Recognition (NER) dataset, addressing the lack of annotated resources for Kurdish, a low-resource language in Natural Language Processing (NLP). The corpus was collected from publicly available Kurdish news articles published in 2024 to ensure its relevance to contemporary language use. It spans a wide range of domains, including politics, economics, sports, health, culture, interviews, and technology, providing comprehensive coverage of named entities across various contexts. To ensure the trustworthiness of the content, the news articles were carefully selected from accredited Kurdish outlets. Annotation was performed using a lexicon-based approach, leveraging a pre-defined lexicon to maintain consistency and accuracy. The dataset was preprocessed as follows: entities were labeled using seed words from the pre-defined lexicon, and the BIO (Begin, Inside, Outside) tagging scheme was applied to ensure compatibility with widely used NER models. The dataset is available in TXT format (.txt), making it readily accessible and flexible for use in a variety of research applications. The Adyan dataset can be utilized for multiple NLP tasks, including NER, sentiment analysis, machine translation, and text classification. It is publicly released to support ongoing research and development in NLP for low-resource languages.

摘要

本文介绍了首个高质量的、自动标注的索拉尼库尔德语命名实体识别(NER)数据集,以解决库尔德语(自然语言处理(NLP)中的一种资源匮乏语言)标注资源不足的问题。语料库是从2024年发布的公开可用的库尔德语新闻文章中收集的,以确保其与当代语言使用的相关性。它涵盖广泛的领域,包括政治、经济、体育、健康、文化、访谈和技术,全面覆盖各种语境中的命名实体。为确保内容的可信度,新闻文章是从经认可的库尔德媒体中精心挑选的。标注采用基于词典的方法,利用预定义的词典来保持一致性和准确性。数据集的预处理如下:使用预定义词典中的种子词对实体进行标注,并应用BIO(开始、内部、外部)标记方案以确保与广泛使用的NER模型兼容。数据集以TXT格式(.txt)提供,便于在各种研究应用中使用且灵活性高。阿迪扬数据集可用于多个NLP任务,包括NER、情感分析、机器翻译和文本分类。它已公开发布,以支持针对低资源语言的NLP正在进行的研究和开发。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/239c/12433466/32c221fb8d83/gr1.jpg

相似文献

本文引用的文献

1
A Kurdish Sorani Twitter dataset for language modelling.一个用于语言建模的库尔德索拉尼语推特数据集。
Data Brief. 2024 Sep 28;57:110967. doi: 10.1016/j.dib.2024.110967. eCollection 2024 Dec.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验