Suppr超能文献

为乌兹别克语开发命名实体识别算法:数据集见解与实现

Developing named entity recognition algorithms for Uzbek: Dataset insights and implementation.

作者信息

Mengliev Davlatyor, Barakhnin Vladimir, Abdurakhmonova Nilufar, Eshkulov Mukhriddin

机构信息

Urgench branch of Tashkent University of Information Technologies named after Muhammad al-Khwarizmi, 110, al-Khwarizmi str., 220100, Urgench city, Uzbekistan.

Novosibirsk State University, 2, Pirogova str., Novosibirsk city, 630090, Russia.

出版信息

Data Brief. 2024 Apr 16;54:110413. doi: 10.1016/j.dib.2024.110413. eCollection 2024 Jun.

Abstract

This paper presents a dataset and approaches to named entity recognition (NLP) in Uzbek language, in a resource-constrained language environment. Despite the increase in NLP applications, the Uzbek language is still underrepresented, which underscores the importance of our work. Our dataset includes 1,160 sentences with nearly 19,000 word forms annotated for parts of speech and named entities, making it a valuable resource for linguistic research and machine learning applications in Uzbek. In addition, for practical application and experiments, the authors have developed two algorithms that, using this dictionary, identifies named entities in Uzbek language texts. In addition, the authors described the methodology for creating the dataset, the design of the algorithms, and their application to the Uzbek language. This study not only provides an important dataset for future named entity recognition(NER) tasks in the Uzbek language, but also offers a methodological basis for the use of vocabulary-based NER or Machine learning NER in other low-resource languages (e.g. Karakalpak). The dataset (and algorithms) we have developed can be used to create applications such as improved chatbot systems, text mining applications and other analytical tools for the Uzbek language, contributing to the development of those areas in the region for which these solutions will be developed.

摘要

本文介绍了在资源受限的语言环境中,乌兹别克语命名实体识别(NLP)的数据集和方法。尽管NLP应用有所增加,但乌兹别克语的代表性仍然不足,这凸显了我们工作的重要性。我们的数据集包括1160个句子,其中近19000个词形被标注了词性和命名实体,使其成为乌兹别克语语言学研究和机器学习应用的宝贵资源。此外,为了实际应用和实验,作者开发了两种算法,利用这个词典来识别乌兹别克语文本中的命名实体。此外,作者还描述了创建数据集的方法、算法的设计及其在乌兹别克语中的应用。这项研究不仅为未来乌兹别克语命名实体识别(NER)任务提供了重要数据集,还为在其他低资源语言(如卡拉卡尔帕克语)中使用基于词汇的NER或机器学习NER提供了方法基础。我们开发的数据集(和算法)可用于创建诸如改进的聊天机器人系统、文本挖掘应用程序和其他乌兹别克语分析工具等应用,为将开发这些解决方案的地区的相关领域发展做出贡献。

相似文献

3
Parallel texts dataset for Uzbek-Kazakh machine translation.乌兹别克语-哈萨克语机器翻译的平行文本数据集。
Data Brief. 2024 Feb 15;53:110194. doi: 10.1016/j.dib.2024.110194. eCollection 2024 Apr.
4
Balinese story texts dataset for narrative text analyses.用于叙事文本分析的巴厘岛故事文本数据集。
Data Brief. 2024 Aug 8;56:110781. doi: 10.1016/j.dib.2024.110781. eCollection 2024 Oct.

本文引用的文献

1
Dataset of Karakalpak language stop words.卡拉卡尔帕克语停用词数据集。
Data Brief. 2023 Apr 5;48:109111. doi: 10.1016/j.dib.2023.109111. eCollection 2023 Jun.
2
Dataset of stopwords extracted from Uzbek texts.从乌兹别克语文本中提取的停用词数据集。
Data Brief. 2022 Jun 3;43:108351. doi: 10.1016/j.dib.2022.108351. eCollection 2022 Aug.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验