Suppr超能文献

利用词性标注特征改进低资源语言对的神经机器翻译。

Improving neural machine translation with POS-tag features for low-resource language pairs.

作者信息

Hlaing Zar Zar, Thu Ye Kyaw, Supnithi Thepchai, Netisopakul Ponrudee

机构信息

Faculty of Information Technology, King Mongkut's Institute of Technology Ladkrabang, Bangkok, 10520, Thailand.

Language and Sematic Research Technology Research Team, NECTEC, Pathum Thani, 12120, Thailand.

出版信息

Heliyon. 2022 Aug 22;8(8):e10375. doi: 10.1016/j.heliyon.2022.e10375. eCollection 2022 Aug.

Abstract

Integrating linguistic features has been widely utilized in statistical machine translation (SMT) systems, resulting in improved translation quality. However, for low-resource languages such as Thai and Myanmar, the integration of linguistic features in neural machine translation (NMT) systems has yet to be implemented. In this study, we propose transformer-based NMT models (transformer, multi-source transformer, and shared-multi-source transformer models) using linguistic features for two-way translation of Thai-to-Myanmar, Myanmar-to-English, and Thai-to-English. Linguistic features such as part-of-speech (POS) tags or universal part-of-speech (UPOS) tags are added to each word on either the source or target side, or both the source and target sides, and the proposed models are conducted. The multi-source transformer and shared-multi-source transformer models take two inputs (i.e., string data and string data with POS tags) and produce string data or string data with POS tags. A transformer model that utilizes only word vectors was used as the first baseline model for comparison with the proposed models. The second baseline model, an Edit-Based Transformer with Repositioning (EDITOR) model, was also used to compare with our proposed models in addition to the baseline transformer model. The findings of the experiments show that adding linguistic features to the transformer-based models enhances the performance of a neural machine translation in low-resource language pairs. Moreover, the best translation results were yielded using shared-multi-source transformer models with linguistic features resulting in more significant Bilingual Evaluation Understudy (BLEU) scores and character n-gram F-score (chrF) scores than the baseline transformer and EDITOR models.

摘要

整合语言特征已在统计机器翻译(SMT)系统中得到广泛应用,从而提高了翻译质量。然而,对于泰语和缅甸语等资源匮乏的语言,神经机器翻译(NMT)系统中语言特征的整合尚未实现。在本研究中,我们提出了基于Transformer的NMT模型(Transformer、多源Transformer和共享多源Transformer模型),用于泰语到缅甸语、缅甸语到英语以及泰语到英语的双向翻译。词性(POS)标签或通用词性(UPOS)标签等语言特征被添加到源语言或目标语言一方的每个单词上,或者同时添加到源语言和目标语言双方的每个单词上,并对提出的模型进行测试。多源Transformer模型和共享多源Transformer模型接受两个输入(即字符串数据和带有POS标签的字符串数据),并生成字符串数据或带有POS标签的字符串数据。仅使用词向量的Transformer模型被用作第一个基线模型,用于与提出的模型进行比较。第二个基线模型是基于编辑的带重新定位的Transformer(EDITOR)模型,除了基线Transformer模型外,也用于与我们提出的模型进行比较。实验结果表明,在基于Transformer的模型中添加语言特征可提高低资源语言对的神经机器翻译性能。此外,使用带有语言特征的共享多源Transformer模型产生了最佳翻译结果,与基线Transformer模型和EDITOR模型相比,其双语评估替补(BLEU)分数和字符n元语法F分数(chrF)分数更高。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6a65/9404341/b28ed7a76b06/gr001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验