为乌兹别克语开发命名实体识别算法：数据集见解与实现

Developing named entity recognition algorithms for Uzbek: Dataset insights and implementation.

作者信息

Mengliev Davlatyor, Barakhnin Vladimir, Abdurakhmonova Nilufar, Eshkulov Mukhriddin

机构信息

Urgench branch of Tashkent University of Information Technologies named after Muhammad al-Khwarizmi, 110, al-Khwarizmi str., 220100, Urgench city, Uzbekistan.

Novosibirsk State University, 2, Pirogova str., Novosibirsk city, 630090, Russia.

出版信息

Data Brief. 2024 Apr 16;54:110413. doi: 10.1016/j.dib.2024.110413. eCollection 2024 Jun.

DOI:10.1016/j.dib.2024.110413

PMID:38708296

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11067374/

Abstract

This paper presents a dataset and approaches to named entity recognition (NLP) in Uzbek language, in a resource-constrained language environment. Despite the increase in NLP applications, the Uzbek language is still underrepresented, which underscores the importance of our work. Our dataset includes 1,160 sentences with nearly 19,000 word forms annotated for parts of speech and named entities, making it a valuable resource for linguistic research and machine learning applications in Uzbek. In addition, for practical application and experiments, the authors have developed two algorithms that, using this dictionary, identifies named entities in Uzbek language texts. In addition, the authors described the methodology for creating the dataset, the design of the algorithms, and their application to the Uzbek language. This study not only provides an important dataset for future named entity recognition(NER) tasks in the Uzbek language, but also offers a methodological basis for the use of vocabulary-based NER or Machine learning NER in other low-resource languages (e.g. Karakalpak). The dataset (and algorithms) we have developed can be used to create applications such as improved chatbot systems, text mining applications and other analytical tools for the Uzbek language, contributing to the development of those areas in the region for which these solutions will be developed.

摘要

本文介绍了在资源受限的语言环境中，乌兹别克语命名实体识别（NLP）的数据集和方法。尽管NLP应用有所增加，但乌兹别克语的代表性仍然不足，这凸显了我们工作的重要性。我们的数据集包括1160个句子，其中近19000个词形被标注了词性和命名实体，使其成为乌兹别克语语言学研究和机器学习应用的宝贵资源。此外，为了实际应用和实验，作者开发了两种算法，利用这个词典来识别乌兹别克语文本中的命名实体。此外，作者还描述了创建数据集的方法、算法的设计及其在乌兹别克语中的应用。这项研究不仅为未来乌兹别克语命名实体识别（NER）任务提供了重要数据集，还为在其他低资源语言（如卡拉卡尔帕克语）中使用基于词汇的NER或机器学习NER提供了方法基础。我们开发的数据集（和算法）可用于创建诸如改进的聊天机器人系统、文本挖掘应用程序和其他乌兹别克语分析工具等应用，为将开发这些解决方案的地区的相关领域发展做出贡献。

相似文献

Developing named entity recognition algorithms for Uzbek: Dataset insights and implementation.为乌兹别克语开发命名实体识别算法：数据集见解与实现

Data Brief. 2024 Apr 16;54:110413. doi: 10.1016/j.dib.2024.110413. eCollection 2024 Jun.

Development of Language Models for Continuous Uzbek Speech Recognition System.开发用于乌兹别克语连续语音识别系统的语言模型。

Sensors (Basel). 2023 Jan 19;23(3):1145. doi: 10.3390/s23031145.

Parallel texts dataset for Uzbek-Kazakh machine translation.乌兹别克语-哈萨克语机器翻译的平行文本数据集。

Data Brief. 2024 Feb 15;53:110194. doi: 10.1016/j.dib.2024.110194. eCollection 2024 Apr.

Balinese story texts dataset for narrative text analyses.用于叙事文本分析的巴厘岛故事文本数据集。

Data Brief. 2024 Aug 8;56:110781. doi: 10.1016/j.dib.2024.110781. eCollection 2024 Oct.

Automatic Speech Recognition Method Based on Deep Learning Approaches for Uzbek Language.基于深度学习方法的乌兹别克语自动语音识别方法。

Sensors (Basel). 2022 May 12;22(10):3683. doi: 10.3390/s22103683.

Assessment of disease named entity recognition on a corpus of annotated sentences.基于带注释句子语料库的疾病命名实体识别评估。

BMC Bioinformatics. 2008 Apr 11;9 Suppl 3(Suppl 3):S3. doi: 10.1186/1471-2105-9-S3-S3.

Europe PMC annotated full-text corpus for gene/proteins, diseases and organisms.欧洲 PMC 注释全文生物库，包含基因/蛋白质、疾病和生物信息。

Sci Data. 2023 Oct 19;10(1):722. doi: 10.1038/s41597-023-02617-x.

Analyzing transfer learning impact in biomedical cross-lingual named entity recognition and normalization.分析迁移学习在生物医学跨语言命名实体识别和标准化中的影响。

BMC Bioinformatics. 2021 Dec 17;22(Suppl 1):601. doi: 10.1186/s12859-021-04247-9.

FoodBase corpus: a new resource of annotated food entities.FoodBase 语料库：一个新的带注释食物实体资源。

Database (Oxford). 2019 Jan 1;2019. doi: 10.1093/database/baz121.

DarNERcorp: An annotated named entity recognition dataset in the Moroccan dialect.DarNERcorp：摩洛哥方言中的一个带注释的命名实体识别数据集。

Data Brief. 2023 May 12;48:109234. doi: 10.1016/j.dib.2023.109234. eCollection 2023 Jun.

引用本文的文献

An annotated morphological dataset for Uzbek word forms: Towards rule-based and machine learning approaches.乌兹别克语单词形式的带注释形态数据集：迈向基于规则和机器学习的方法。

Data Brief. 2025 May 26;61:111702. doi: 10.1016/j.dib.2025.111702. eCollection 2025 Aug.

Dataset of Uzbek verbs with formation and suffixes.带有词形变化和后缀的乌兹别克语动词数据集。

Data Brief. 2025 May 30;61:111731. doi: 10.1016/j.dib.2025.111731. eCollection 2025 Aug.

A comprehensive dataset and neural network approach for named entity recognition in the Uzbek language.一种用于乌兹别克语命名实体识别的综合数据集和神经网络方法。

Data Brief. 2024 Dec 19;58:111249. doi: 10.1016/j.dib.2024.111249. eCollection 2025 Feb.

本文引用的文献

Dataset of Karakalpak language stop words.卡拉卡尔帕克语停用词数据集。

Data Brief. 2023 Apr 5;48:109111. doi: 10.1016/j.dib.2023.109111. eCollection 2023 Jun.

Dataset of stopwords extracted from Uzbek texts.从乌兹别克语文本中提取的停用词数据集。

Data Brief. 2022 Jun 3;43:108351. doi: 10.1016/j.dib.2022.108351. eCollection 2022 Aug.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。