Allaberdiev Bobur, Matlatipov Gayrat, Kuriyozov Elmurod, Rakhmonov Zafar
National University of Uzbekistan named after Mirzo Ulugbek, Universitet Street, 4, Olmazor district, 100174, Tashkent city, Uzbekistan.
Urgench State University, Khamid Alimdjan, 14, 220100, Urgench City, Uzbekistan.
Data Brief. 2024 Feb 15;53:110194. doi: 10.1016/j.dib.2024.110194. eCollection 2024 Apr.
This paper presents a parallel corpus of raw texts between the Uzbek and Kazakh languages as a dataset for machine translation applications, focusing on the data collection process, dataset description, and its potential for reuse. The dataset-building process includes three separate stages, starting with a tiny portion of already available parallel data, then some more compiled from openly available resources like literature books, and web news texts, which were aligned using the sentence alignment method, encompassing a wide range of topics and genres. Finally, the majority of the dataset was taken from a raw text corpus in Uzbek and manually translated into Kazakh by a group of experts who are fluent in both languages. The resulting parallel corpus serves as a valuable resource for researchers and practitioners interested in Kazakh and Uzbek language processing tasks, particularly in the context of neural machine translation, where the presented data can be used for testing the rule-based machine translation models, or it can be used for both training statistical and neural machine translation models as well. The dataset has been made accessible through the widely recognized Hugging Face platform, a repository known for facilitating collaborative efforts and advancing Natural Language Processing (NLP) applications. This combination of methods to obtain a parallel corpus plays as a pivot for other languages among other low-resource Turkic languages.
本文展示了乌兹别克语和哈萨克语之间的原始文本平行语料库,作为机器翻译应用的数据集,重点介绍了数据收集过程、数据集描述及其重用潜力。数据集构建过程包括三个独立阶段,首先是一小部分已有的平行数据,然后是从文学书籍和网络新闻文本等公开可用资源中编译的更多数据,这些数据使用句子对齐方法进行对齐,涵盖广泛的主题和体裁。最后,数据集中的大部分内容取自乌兹别克语的原始文本语料库,并由一组精通这两种语言的专家手动翻译成哈萨克语。由此产生的平行语料库为对哈萨克语和乌兹别克语语言处理任务感兴趣的研究人员和从业人员提供了宝贵资源,特别是在神经机器翻译的背景下,所呈现的数据可用于测试基于规则的机器翻译模型,也可用于训练统计机器翻译模型和神经机器翻译模型。该数据集已通过广为人知的Hugging Face平台提供,该平台是一个以促进合作努力和推进自然语言处理(NLP)应用而闻名的存储库。这种获取平行语料库的方法组合在其他低资源突厥语族语言中对其他语言起到了关键作用。