乌兹别克语单词形式的带注释形态数据集：迈向基于规则和机器学习的方法。

An annotated morphological dataset for Uzbek word forms: Towards rule-based and machine learning approaches.

作者信息

Abdurakhmonova Nilufar, Shirinova Raima, Sayfullayeva Rano, Mengliev Davlatyor, Ibragimov Bahodir, Ernazarova Manzura

机构信息

National University of Uzbekistan named after Mirzo Ulugbek, 4, University str., Tashkent city 100174, Uzbekistan.

Urgench branch of Tashkent University of Information Technologies named after Muhammad al-Khwarizmi, 110, al-Khwarizmi str., Urgench city 220100, Uzbekistan.

出版信息

Data Brief. 2025 May 26;61:111702. doi: 10.1016/j.dib.2025.111702. eCollection 2025 Aug.

DOI:10.1016/j.dib.2025.111702

PMID:40861398

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12372940/

Abstract

This research paper presents a morphologically annotated dataset for the Uzbek language, specifically designed for morphological analysis algorithms. The dataset contains 3022 manually annotated word forms, each annotated with root, affix, and part-of-speech information. Two morphological analysis approaches were implemented and compared: a user-defined rule-based stemming algorithm and a conditional random fields (CRF)-based machine learning model. Additionally, comprehensive genre testing was conducted on legal, political-economic, and educational texts to assess generalizability. The dataset is publicly available in Excel format and is intended as a base resource for further research in the field of natural language processing in Uzbek, including applications in text generation, semantic analysis, and grammar correction.

摘要

这篇研究论文展示了一个乌兹别克语的形态学注释数据集，该数据集是专门为形态学分析算法设计的。数据集中包含3022个手动注释的词形，每个词形都标注了词根、词缀和词性信息。实现并比较了两种形态学分析方法：一种是用户定义的基于规则的词干提取算法，另一种是基于条件随机场（CRF）的机器学习模型。此外，还对法律、政治经济和教育文本进行了全面的体裁测试，以评估其通用性。该数据集以Excel格式公开可用，旨在作为乌兹别克语自然语言处理领域进一步研究的基础资源，包括文本生成、语义分析和语法校正等应用。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3517/12372940/6785cceb4d5f/gr1.jpg

相似文献

An annotated morphological dataset for Uzbek word forms: Towards rule-based and machine learning approaches.乌兹别克语单词形式的带注释形态数据集：迈向基于规则和机器学习的方法。

Data Brief. 2025 May 26;61:111702. doi: 10.1016/j.dib.2025.111702. eCollection 2025 Aug.

Bochun: Automatically annotated stance detection dataset for Sorani Kurdish language.博春：索拉尼库尔德语自动标注立场检测数据集。

Data Brief. 2025 Jun 25;61:111839. doi: 10.1016/j.dib.2025.111839. eCollection 2025 Aug.

Dataset of Uzbek verbs with formation and suffixes.带有词形变化和后缀的乌兹别克语动词数据集。

Data Brief. 2025 May 30;61:111731. doi: 10.1016/j.dib.2025.111731. eCollection 2025 Aug.

Cognitive decline assessment using semantic linguistic content and transformer deep learning architecture.使用语义语言内容和变压器深度学习架构评估认知能力下降。

Int J Lang Commun Disord. 2024 May-Jun;59(3):1110-1127. doi: 10.1111/1460-6984.12973. Epub 2023 Nov 16.

Natural Language Processing and Coding for Detecting Bleeding Events in Discharge Summaries: Comparative Cross-Sectional Study.自然语言处理与出院小结中出血事件检测的编码：比较横断面研究

JMIR Med Inform. 2025 Aug 29;13:e67837. doi: 10.2196/67837.

Multicriteria Optimization of Language Models for Heart Failure With Preserved Ejection Fraction Symptom Detection in Spanish Electronic Health Records: Comparative Modeling Study.西班牙电子健康记录中射血分数保留的心力衰竭症状检测语言模型的多标准优化：比较建模研究

J Med Internet Res. 2025 Jul 17;27:e76433. doi: 10.2196/76433.

Use of deep learning-based NLP models for full-text data elements extraction for systematic literature review tasks.基于深度学习的自然语言处理模型在系统文献综述任务的全文数据元素提取中的应用。

Sci Rep. 2025 Jun 3;15(1):19379. doi: 10.1038/s41598-025-03979-5.

Extraction of sleep information from clinical notes of Alzheimer's disease patients using natural language processing.使用自然语言处理从阿尔茨海默病患者的临床记录中提取睡眠信息。

J Am Med Inform Assoc. 2024 Oct 1;31(10):2217-2227. doi: 10.1093/jamia/ocae177.

Developing named entity recognition algorithms for Uzbek: Dataset insights and implementation.为乌兹别克语开发命名实体识别算法：数据集见解与实现

Data Brief. 2024 Apr 16;54:110413. doi: 10.1016/j.dib.2024.110413. eCollection 2024 Jun.

Uncovering Pregnancy Exposures in Pharmacovigilance Case Report Databases: A Comprehensive Evaluation of the VigiBase Pregnancy Algorithm.在药物警戒病例报告数据库中发现孕期暴露情况：对VigiBase孕期算法的全面评估

Drug Saf. 2025 Jun 23. doi: 10.1007/s40264-025-01559-0.

本文引用的文献

Dataset of vocabulary in Uzbek primary education: Extraction and analysis in case of the school corpus.乌兹别克斯坦小学教育词汇数据集：基于学校语料库的提取与分析

Data Brief. 2025 Feb 3;59:111349. doi: 10.1016/j.dib.2025.111349. eCollection 2025 Apr.

A comprehensive dataset and neural network approach for named entity recognition in the Uzbek language.一种用于乌兹别克语命名实体识别的综合数据集和神经网络方法。

Data Brief. 2024 Dec 19;58:111249. doi: 10.1016/j.dib.2024.111249. eCollection 2025 Feb.

Developing named entity recognition algorithms for Uzbek: Dataset insights and implementation.为乌兹别克语开发命名实体识别算法：数据集见解与实现

Data Brief. 2024 Apr 16;54:110413. doi: 10.1016/j.dib.2024.110413. eCollection 2024 Jun.

Dataset of Karakalpak language stop words.卡拉卡尔帕克语停用词数据集。

Data Brief. 2023 Apr 5;48:109111. doi: 10.1016/j.dib.2023.109111. eCollection 2023 Jun.

Dataset of stopwords extracted from Uzbek texts.从乌兹别克语文本中提取的停用词数据集。

Data Brief. 2022 Jun 3;43:108351. doi: 10.1016/j.dib.2022.108351. eCollection 2022 Aug.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

乌兹别克语单词形式的带注释形态数据集：迈向基于规则和机器学习的方法。

An annotated morphological dataset for Uzbek word forms: Towards rule-based and machine learning approaches.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

本文引用的文献