Natural Language Processing and Knowledge Discovery Laboratory, Faculty of Information Technology, Ton Duc Thang University, Ho Chi Minh City, Vietnam.
Faculty of Information Technology, University of Science, Ho Chi Minh City, Vietnam.
Comput Intell Neurosci. 2022 Jun 3;2022:6856567. doi: 10.1155/2022/6856567. eCollection 2022.
Transformer neural models with multihead attentions outperform all existing translation models. Nevertheless, some features of traditional statistical models, such as prior alignment between source and target words, prove useful in training the state-of-the-art Transformer models. It has been reported that lightweight prior alignment can effectively guide a head in the multihead cross-attention sublayer responsible for the alignment of Transformer models. In this work, we make a step further by applying heavyweight prior alignments to guide all heads. Specifically, we use the weight of 0.5 for the alignment cost added to the token cost in formulating the overall cost of training a Transformer model, where the alignment cost is defined as the deviation of the attention probability from the prior alignments. Moreover, we increase the role of prior alignment, computing the attention probability by averaging all heads of the multihead attention sublayer within the penultimate layer of Transformer model. Experimental results on an English-Vietnamese translation task show that our proposed approach helps train superior Transformer-based translation models. Our Transformer model (25.71) outperforms the baseline model (21.34) by the large 4.37 BLEU. Case studies by native speakers on some translation results validate the machine judgment. The results so far encourage the use of heavyweight prior alignments to improve Transformer-based translation models. This work contributes to the literature on the machine translation, especially, for unpopular language pairs. Since the proposal in this work is language-independent, it can be applied to different language pairs, including Slavic languages.
具有多头注意力的 Transformer 神经模型优于所有现有的翻译模型。然而,传统统计模型的一些特征,如源词和目标词之间的先验对齐,在训练最先进的 Transformer 模型时被证明是有用的。据报道,轻量级的先验对齐可以有效地指导多头交叉注意力子层中的一个头,负责 Transformer 模型的对齐。在这项工作中,我们更进一步,应用重量级的先验对齐来指导所有的头。具体来说,我们在制定 Transformer 模型训练的总成本时,将添加到令牌成本中的对齐成本的权重设置为 0.5,其中对齐成本定义为注意力概率与先验对齐的偏差。此外,我们增加了先验对齐的作用,通过在 Transformer 模型的倒数第二层中对多头注意力子层的所有头进行平均来计算注意力概率。在英越翻译任务上的实验结果表明,我们提出的方法有助于训练更好的基于 Transformer 的翻译模型。我们的 Transformer 模型(25.71)比基线模型(21.34)高出 4.37 个 BLEU。以母语为英语的人对一些翻译结果的案例研究验证了机器的判断。到目前为止的结果鼓励使用重量级的先验对齐来改进基于 Transformer 的翻译模型。这项工作对机器翻译的文献做出了贡献,特别是对于不受欢迎的语言对。由于这项工作的建议是与语言无关的,因此它可以应用于不同的语言对,包括斯拉夫语言。