College of Software Engineering, Zhengzhou University of Light Industry, Zhengzhou 450000, China.
College of Mathematics and Information Science, Zhengzhou University of Light Industry, Zhengzhou 450000, China.
Comput Intell Neurosci. 2022 Aug 3;2022:5296946. doi: 10.1155/2022/5296946. eCollection 2022.
Machine translation relies on parallel sentences, the number of which is an important factor affecting the performance of machine translation systems, especially in low-resource languages. Recent advances in learning cross-lingual word representations from nonparallel data by machine learning make a new possibility for obtaining bilingual sentences with minimal supervision in low-resource languages. In this paper, we introduce a novel methodology to obtain parallel sentences via only a small-size bilingual seed lexicon about hundreds of entries. We first obtain bilingual semantic by establishing cross-lingual mapping in monolingual languages via a seed lexicon. Then, we construct a deep learning classifier to extract bilingual parallel sentences. We demonstrate the effectiveness of our methodology by harvesting Uyghur-Chinese parallel sentences and constructing a machine translation system. The experiments indicate that our method can obtain large and high-accuracy bilingual parallel sentences in low-resource language pairs.
机器翻译依赖于平行句子,平行句子的数量是影响机器翻译系统性能的一个重要因素,尤其是在资源匮乏的语言中。最近,机器学习从非平行数据中学习跨语言单词表示的进展为在资源匮乏的语言中通过最小监督获得双语句子提供了新的可能性。在本文中,我们介绍了一种通过仅使用数百个双语种子词典来获取平行句子的新方法。我们首先通过种子词典在单语语言中建立跨语言映射来获得双语语义。然后,我们构建一个深度学习分类器来提取双语平行句子。我们通过收获维吾尔语-汉语平行句子并构建机器翻译系统来证明我们方法的有效性。实验表明,我们的方法可以在资源匮乏的语言对中获得大量且高精度的双语平行句子。