Nguyen Thien, Nguyen Huu, Tran Phuoc
Faculty of Information Technology, Ton Duc Thang University, Ho Chi Minh City, Vietnam.
Faculty of Information Technology, Ho Chi Minh City University of Food Industry, Ho Chi Minh City, Vietnam.
Comput Intell Neurosci. 2020 Nov 29;2020:8859452. doi: 10.1155/2020/8859452. eCollection 2020.
Building the first Russian-Vietnamese neural machine translation system, we faced the problem of choosing a translation unit system on which source and target embeddings are based. Available homogeneous translation unit systems with the same translation unit on the source and target sides do not perfectly suit the investigated language pair. To solve the problem, in this paper, we propose a novel heterogeneous translation unit system, considering linguistic characteristics of the synthetic Russian language and the analytic Vietnamese language. Specifically, we decrease the embedding level on the source side by splitting token into subtokens and increase the embedding level on the target side by merging neighboring tokens into supertoken. The experiment results show that the proposed heterogeneous system improves over the existing best homogeneous Russian-Vietnamese translation system by 1.17 BLEU. Our approach could be applied to building translation bots for language pairs with different linguistic characteristics.
在构建首个俄越神经机器翻译系统时,我们面临着选择源嵌入和目标嵌入所基于的翻译单元系统的问题。现有的在源端和目标端具有相同翻译单元的同构翻译单元系统并不完全适用于所研究的语言对。为了解决这个问题,在本文中,我们考虑到俄语合成语言和越南语分析语言的语言特点,提出了一种新颖的异构翻译单元系统。具体来说,我们通过将词元拆分为子词元来降低源端的嵌入级别,并通过将相邻词元合并为超级词元来提高目标端的嵌入级别。实验结果表明,所提出的异构系统比现有的最佳俄越同构翻译系统的BLEU得分提高了1.17。我们的方法可应用于为具有不同语言特点的语言对构建翻译机器人。