一种用于乌兹别克语命名实体识别的综合数据集和神经网络方法。

A comprehensive dataset and neural network approach for named entity recognition in the Uzbek language.

作者信息

Mengliev Davlatyor, Barakhnin Vladimir, Eshkulov Mukhriddin, Ibragimov Bahodir, Madirimov Shohrux

机构信息

Urgench branch of Tashkent University of Information Technologies named after Muhammad al-Khwarizmi, 110, al-Khwarizmi str., 220100 Urgench city, Uzbekistan.

Novosibirsk State University, 2, Pirogova str., Novosibirsk city 630090, Russia.

出版信息

Data Brief. 2024 Dec 19;58:111249. doi: 10.1016/j.dib.2024.111249. eCollection 2025 Feb.

DOI:10.1016/j.dib.2024.111249

PMID:39811531

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11732609/

Abstract

In this study, the authors presented a dataset for named entity recognition in the Uzbek language. The dataset consists of 2000 sentences and 25,865 words, and the sources were legal documents and hand-crafted sentences annotated using the BIOES scheme. The study is complemented by the fact that the authors demonstrated the applications of the created dataset by training a language model using the CNN + LSTM architecture, which achieves high accuracy in NER tasks, with an F1 score of 90.8 %, precision of 93.9 %, and recall of 88.0 % on the test set. The proposed dataset and trained model contribute to the development of natural language processing in the Uzbek language. In addition, the authors also conducted an analysis of existing works, as well as a comparative analysis, which will help to identify the distinctive features and novelty of the proposed work. Moreover, in conclusion, the authors propose possible scenarios for the development of the work, in the form of further scaling of the dataset, as well as the use of other neural network architectures.

摘要

在本研究中，作者展示了一个乌兹别克语命名实体识别的数据集。该数据集由2000个句子和25865个单词组成，其来源为法律文件和使用BIOES方案标注的人工编写句子。此外，作者通过使用CNN + LSTM架构训练语言模型展示了所创建数据集的应用，该模型在命名实体识别任务中取得了高精度，在测试集上的F1分数为90.8%，精确率为93.9%，召回率为88.0%。所提出的数据集和训练模型有助于乌兹别克语自然语言处理的发展。此外，作者还对现有作品进行了分析以及对比分析，这将有助于确定所提出工作的独特特征和新颖之处。此外，作者在结论中提出了工作发展的可能场景，包括进一步扩大数据集规模以及使用其他神经网络架构。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bc42/11732609/6aeff2780844/gr1.jpg

相似文献

A comprehensive dataset and neural network approach for named entity recognition in the Uzbek language.一种用于乌兹别克语命名实体识别的综合数据集和神经网络方法。

Data Brief. 2024 Dec 19;58:111249. doi: 10.1016/j.dib.2024.111249. eCollection 2025 Feb.

Developing named entity recognition algorithms for Uzbek: Dataset insights and implementation.为乌兹别克语开发命名实体识别算法：数据集见解与实现

Data Brief. 2024 Apr 16;54:110413. doi: 10.1016/j.dib.2024.110413. eCollection 2024 Jun.

Development of Language Models for Continuous Uzbek Speech Recognition System.开发用于乌兹别克语连续语音识别系统的语言模型。

Sensors (Basel). 2023 Jan 19;23(3):1145. doi: 10.3390/s23031145.

Evaluation of clinical named entity recognition methods for Serbian electronic health records.评估塞尔维亚电子健康记录中的临床命名实体识别方法。

Int J Med Inform. 2022 Aug;164:104805. doi: 10.1016/j.ijmedinf.2022.104805. Epub 2022 May 25.

Sample Size Considerations for Fine-Tuning Large Language Models for Named Entity Recognition Tasks: Methodological Study.用于命名实体识别任务的大语言模型微调的样本量考量：方法学研究

JMIR AI. 2024 May 16;3:e52095. doi: 10.2196/52095.

Automatic Speech Recognition Method Based on Deep Learning Approaches for Uzbek Language.基于深度学习方法的乌兹别克语自动语音识别方法。

Sensors (Basel). 2022 May 12;22(10):3683. doi: 10.3390/s22103683.

A novel Data and Model Centric artificial intelligence based approach in developing high-performance Named Entity Recognition for Bengali Language.一种基于数据和模型为中心的人工智能方法，用于开发高性能的孟加拉语命名实体识别。

PLoS One. 2023 Sep 22;18(9):e0287818. doi: 10.1371/journal.pone.0287818. eCollection 2023.

Using Synthetic Health Care Data to Leverage Large Language Models for Named Entity Recognition: Development and Validation Study.利用合成医疗保健数据借助大语言模型进行命名实体识别：开发与验证研究。

J Med Internet Res. 2025 Mar 18;27:e66279. doi: 10.2196/66279.

Putting hands to rest: efficient deep CNN-RNN architecture for chemical named entity recognition with no hand-crafted rules.将手工操作搁置一旁：用于化学命名实体识别的高效深度卷积神经网络-循环神经网络架构，无需手工规则。

J Cheminform. 2018 May 23;10(1):28. doi: 10.1186/s13321-018-0280-0.

Research on Chinese medical named entity recognition based on collaborative cooperation of multiple neural network models.基于多神经网络模型协同合作的中医命名实体识别研究

J Biomed Inform. 2020 Apr;104:103395. doi: 10.1016/j.jbi.2020.103395. Epub 2020 Feb 25.

引用本文的文献

An annotated morphological dataset for Uzbek word forms: Towards rule-based and machine learning approaches.乌兹别克语单词形式的带注释形态数据集：迈向基于规则和机器学习的方法。

Data Brief. 2025 May 26;61:111702. doi: 10.1016/j.dib.2025.111702. eCollection 2025 Aug.

本文引用的文献

Developing named entity recognition algorithms for Uzbek: Dataset insights and implementation.为乌兹别克语开发命名实体识别算法：数据集见解与实现

Data Brief. 2024 Apr 16;54:110413. doi: 10.1016/j.dib.2024.110413. eCollection 2024 Jun.

Parallel texts dataset for Uzbek-Kazakh machine translation.乌兹别克语-哈萨克语机器翻译的平行文本数据集。

Data Brief. 2024 Feb 15;53:110194. doi: 10.1016/j.dib.2024.110194. eCollection 2024 Apr.

Dataset of Karakalpak language stop words.卡拉卡尔帕克语停用词数据集。

Data Brief. 2023 Apr 5;48:109111. doi: 10.1016/j.dib.2023.109111. eCollection 2023 Jun.

Dataset of stopwords extracted from Uzbek texts.从乌兹别克语文本中提取的停用词数据集。

Data Brief. 2022 Jun 3;43:108351. doi: 10.1016/j.dib.2022.108351. eCollection 2022 Aug.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

一种用于乌兹别克语命名实体识别的综合数据集和神经网络方法。

A comprehensive dataset and neural network approach for named entity recognition in the Uzbek language.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献