Karyukin Vladislav, Rakhimova Diana, Karibayeva Aidana, Turganbayeva Aliya, Turarbek Asem
Department of Information Systems, Al-Farabi Kazakh National University, Almaty, Kazakhstan.
Institute of Information and Computational Technologies, Almaty, Kazakhstan.
PeerJ Comput Sci. 2023 Feb 8;9:e1224. doi: 10.7717/peerj-cs.1224. eCollection 2023.
The development of the machine translation field was driven by people's need to communicate with each other globally by automatically translating words, sentences, and texts from one language into another. The neural machine translation approach has become one of the most significant in recent years. This approach requires large parallel corpora not available for low-resource languages, such as the Kazakh language, which makes it difficult to achieve the high performance of the neural machine translation models. This article explores the existing methods for dealing with low-resource languages by artificially increasing the size of the corpora and improving the performance of the Kazakh-English machine translation models. These methods are called forward translation, backward translation, and transfer learning. Then the Sequence-to-Sequence (recurrent neural network and bidirectional recurrent neural network) and Transformer neural machine translation architectures with their features and specifications are concerned for conducting experiments in training models on parallel corpora. The experimental part focuses on building translation models for the high-quality translation of formal social, political, and scientific texts with the synthetic parallel sentences from existing monolingual data in the Kazakh language using the forward translation approach and combining them with the parallel corpora parsed from the official government websites. The total corpora of 380,000 parallel Kazakh-English sentences are trained on the recurrent neural network, bidirectional recurrent neural network, and Transformer models of the OpenNMT framework. The quality of the trained model is evaluated with the BLEU, WER, and TER metrics. Moreover, the sample translations were also analyzed. The RNN and BRNN models showed a more precise translation than the Transformer model. The Byte-Pair Encoding tokenization technique showed better metrics scores and translation than the word tokenization technique. The Bidirectional recurrent neural network with the Byte-Pair Encoding technique showed the best performance with 0.49 BLEU, 0.51 WER, and 0.45 TER.
机器翻译领域的发展是由人们通过自动将单词、句子和文本从一种语言翻译成另一种语言来进行全球交流的需求所驱动的。近年来,神经机器翻译方法已成为最重要的方法之一。这种方法需要大量的平行语料库,而低资源语言(如哈萨克语)无法获取这些语料库,这使得难以实现神经机器翻译模型的高性能。本文探讨了通过人工增加语料库规模和提高哈萨克语-英语机器翻译模型性能来处理低资源语言的现有方法。这些方法被称为正向翻译、反向翻译和迁移学习。然后关注序列到序列(循环神经网络和双向循环神经网络)以及Transformer神经机器翻译架构及其特征和规格,以便在平行语料库上训练模型时进行实验。实验部分重点是使用正向翻译方法,利用哈萨克语现有单语数据中的合成平行句子,并将它们与从官方政府网站解析的平行语料库相结合,构建用于正式社会、政治和科学文本高质量翻译的翻译模型。在OpenNMT框架的循环神经网络、双向循环神经网络和Transformer模型上训练了总共380,000个哈萨克语-英语平行句子的语料库。使用BLEU、WER和TER指标评估训练模型的质量。此外,还对示例翻译进行了分析。RNN和BRNN模型的翻译比Transformer模型更精确。字节对编码分词技术比单词分词技术显示出更好的指标分数和翻译效果。采用字节对编码技术的双向循环神经网络表现最佳,BLEU为0.49,WER为0.51,TER为0.45。