Suppr超能文献

乌兹别克语单词形式的带注释形态数据集:迈向基于规则和机器学习的方法。

An annotated morphological dataset for Uzbek word forms: Towards rule-based and machine learning approaches.

作者信息

Abdurakhmonova Nilufar, Shirinova Raima, Sayfullayeva Rano, Mengliev Davlatyor, Ibragimov Bahodir, Ernazarova Manzura

机构信息

National University of Uzbekistan named after Mirzo Ulugbek, 4, University str., Tashkent city 100174, Uzbekistan.

Urgench branch of Tashkent University of Information Technologies named after Muhammad al-Khwarizmi, 110, al-Khwarizmi str., Urgench city 220100, Uzbekistan.

出版信息

Data Brief. 2025 May 26;61:111702. doi: 10.1016/j.dib.2025.111702. eCollection 2025 Aug.

Abstract

This research paper presents a morphologically annotated dataset for the Uzbek language, specifically designed for morphological analysis algorithms. The dataset contains 3022 manually annotated word forms, each annotated with root, affix, and part-of-speech information. Two morphological analysis approaches were implemented and compared: a user-defined rule-based stemming algorithm and a conditional random fields (CRF)-based machine learning model. Additionally, comprehensive genre testing was conducted on legal, political-economic, and educational texts to assess generalizability. The dataset is publicly available in Excel format and is intended as a base resource for further research in the field of natural language processing in Uzbek, including applications in text generation, semantic analysis, and grammar correction.

摘要

这篇研究论文展示了一个乌兹别克语的形态学注释数据集,该数据集是专门为形态学分析算法设计的。数据集中包含3022个手动注释的词形,每个词形都标注了词根、词缀和词性信息。实现并比较了两种形态学分析方法:一种是用户定义的基于规则的词干提取算法,另一种是基于条件随机场(CRF)的机器学习模型。此外,还对法律、政治经济和教育文本进行了全面的体裁测试,以评估其通用性。该数据集以Excel格式公开可用,旨在作为乌兹别克语自然语言处理领域进一步研究的基础资源,包括文本生成、语义分析和语法校正等应用。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3517/12372940/6785cceb4d5f/gr1.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验