Mengliev Davlatyor, Barakhnin Vladimir, Abdurakhmonova Nilufar, Eshkulov Mukhriddin
Urgench branch of Tashkent University of Information Technologies named after Muhammad al-Khwarizmi, 110, al-Khwarizmi str., 220100, Urgench city, Uzbekistan.
Novosibirsk State University, 2, Pirogova str., Novosibirsk city, 630090, Russia.
Data Brief. 2024 Apr 16;54:110413. doi: 10.1016/j.dib.2024.110413. eCollection 2024 Jun.
This paper presents a dataset and approaches to named entity recognition (NLP) in Uzbek language, in a resource-constrained language environment. Despite the increase in NLP applications, the Uzbek language is still underrepresented, which underscores the importance of our work. Our dataset includes 1,160 sentences with nearly 19,000 word forms annotated for parts of speech and named entities, making it a valuable resource for linguistic research and machine learning applications in Uzbek. In addition, for practical application and experiments, the authors have developed two algorithms that, using this dictionary, identifies named entities in Uzbek language texts. In addition, the authors described the methodology for creating the dataset, the design of the algorithms, and their application to the Uzbek language. This study not only provides an important dataset for future named entity recognition(NER) tasks in the Uzbek language, but also offers a methodological basis for the use of vocabulary-based NER or Machine learning NER in other low-resource languages (e.g. Karakalpak). The dataset (and algorithms) we have developed can be used to create applications such as improved chatbot systems, text mining applications and other analytical tools for the Uzbek language, contributing to the development of those areas in the region for which these solutions will be developed.
本文介绍了在资源受限的语言环境中,乌兹别克语命名实体识别(NLP)的数据集和方法。尽管NLP应用有所增加,但乌兹别克语的代表性仍然不足,这凸显了我们工作的重要性。我们的数据集包括1160个句子,其中近19000个词形被标注了词性和命名实体,使其成为乌兹别克语语言学研究和机器学习应用的宝贵资源。此外,为了实际应用和实验,作者开发了两种算法,利用这个词典来识别乌兹别克语文本中的命名实体。此外,作者还描述了创建数据集的方法、算法的设计及其在乌兹别克语中的应用。这项研究不仅为未来乌兹别克语命名实体识别(NER)任务提供了重要数据集,还为在其他低资源语言(如卡拉卡尔帕克语)中使用基于词汇的NER或机器学习NER提供了方法基础。我们开发的数据集(和算法)可用于创建诸如改进的聊天机器人系统、文本挖掘应用程序和其他乌兹别克语分析工具等应用,为将开发这些解决方案的地区的相关领域发展做出贡献。