Mengliev Davlatyor, Barakhnin Vladimir, Eshkulov Mukhriddin, Ibragimov Bahodir, Madirimov Shohrux
Urgench branch of Tashkent University of Information Technologies named after Muhammad al-Khwarizmi, 110, al-Khwarizmi str., 220100 Urgench city, Uzbekistan.
Novosibirsk State University, 2, Pirogova str., Novosibirsk city 630090, Russia.
Data Brief. 2024 Dec 19;58:111249. doi: 10.1016/j.dib.2024.111249. eCollection 2025 Feb.
In this study, the authors presented a dataset for named entity recognition in the Uzbek language. The dataset consists of 2000 sentences and 25,865 words, and the sources were legal documents and hand-crafted sentences annotated using the BIOES scheme. The study is complemented by the fact that the authors demonstrated the applications of the created dataset by training a language model using the CNN + LSTM architecture, which achieves high accuracy in NER tasks, with an F1 score of 90.8 %, precision of 93.9 %, and recall of 88.0 % on the test set. The proposed dataset and trained model contribute to the development of natural language processing in the Uzbek language. In addition, the authors also conducted an analysis of existing works, as well as a comparative analysis, which will help to identify the distinctive features and novelty of the proposed work. Moreover, in conclusion, the authors propose possible scenarios for the development of the work, in the form of further scaling of the dataset, as well as the use of other neural network architectures.
在本研究中,作者展示了一个乌兹别克语命名实体识别的数据集。该数据集由2000个句子和25865个单词组成,其来源为法律文件和使用BIOES方案标注的人工编写句子。此外,作者通过使用CNN + LSTM架构训练语言模型展示了所创建数据集的应用,该模型在命名实体识别任务中取得了高精度,在测试集上的F1分数为90.8%,精确率为93.9%,召回率为88.0%。所提出的数据集和训练模型有助于乌兹别克语自然语言处理的发展。此外,作者还对现有作品进行了分析以及对比分析,这将有助于确定所提出工作的独特特征和新颖之处。此外,作者在结论中提出了工作发展的可能场景,包括进一步扩大数据集规模以及使用其他神经网络架构。