Abdurakhmonova Nilufar, Shirinova Raima, Sayfullayeva Rano, Mengliev Davlatyor, Ibragimov Bahodir, Ernazarova Manzura
National University of Uzbekistan named after Mirzo Ulugbek, 4, University str., Tashkent city 100174, Uzbekistan.
Urgench branch of Tashkent University of Information Technologies named after Muhammad al-Khwarizmi, 110, al-Khwarizmi str., Urgench city 220100, Uzbekistan.
Data Brief. 2025 May 26;61:111702. doi: 10.1016/j.dib.2025.111702. eCollection 2025 Aug.
This research paper presents a morphologically annotated dataset for the Uzbek language, specifically designed for morphological analysis algorithms. The dataset contains 3022 manually annotated word forms, each annotated with root, affix, and part-of-speech information. Two morphological analysis approaches were implemented and compared: a user-defined rule-based stemming algorithm and a conditional random fields (CRF)-based machine learning model. Additionally, comprehensive genre testing was conducted on legal, political-economic, and educational texts to assess generalizability. The dataset is publicly available in Excel format and is intended as a base resource for further research in the field of natural language processing in Uzbek, including applications in text generation, semantic analysis, and grammar correction.
这篇研究论文展示了一个乌兹别克语的形态学注释数据集,该数据集是专门为形态学分析算法设计的。数据集中包含3022个手动注释的词形,每个词形都标注了词根、词缀和词性信息。实现并比较了两种形态学分析方法:一种是用户定义的基于规则的词干提取算法,另一种是基于条件随机场(CRF)的机器学习模型。此外,还对法律、政治经济和教育文本进行了全面的体裁测试,以评估其通用性。该数据集以Excel格式公开可用,旨在作为乌兹别克语自然语言处理领域进一步研究的基础资源,包括文本生成、语义分析和语法校正等应用。