Guo Junjun, Su Rui, Ye Junjie
Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming, Yunnan, 650500, China; Yunnan Key Laboratory of Artificial Intelligence, Kunming University of Science and Technology, Kunming, Yunnan, 650500, China.
Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming, Yunnan, 650500, China; Yunnan Key Laboratory of Artificial Intelligence, Kunming University of Science and Technology, Kunming, Yunnan, 650500, China.
Neural Netw. 2024 Oct;178:106403. doi: 10.1016/j.neunet.2024.106403. Epub 2024 May 23.
The goal of multi-modal neural machine translation (MNMT) is to incorporate language-agnostic visual information into text to enhance the performance of machine translation. However, due to the inherent differences between image and text, these two modalities inevitably suffer from semantic mismatch problems. To tackle this issue, this paper adopts a multi-grained visual pivot-guided multi-modal fusion strategy with cross-modal contrastive disentangling to eliminate the linguistic gaps between different languages. By using the disentangled multi-grained visual information as a cross-lingual pivot, we can enhance the alignment between different languages and improve the performance of MNMT. We first introduce text-guided stacked cross-modal disentangling modules to progressively disentangle image into two types of visual information: MT-related visual and background information. Then we effectively integrate these two kinds of multi-grained visual elements to assist target sentence generation. Extensive experiments on four benchmark MNMT datasets are conducted, and the results demonstrate that our proposed approach achieves significant improvement over the other state-of-the-art (SOTA) approaches on all test sets. The in-depth analysis highlights the benefits of text-guided cross-modal disentangling and visual pivot-based multi-modal fusion strategies in MNMT. We release the code at https://github.com/nlp-mnmt/ConVisPiv-MNMT.
多模态神经机器翻译(MNMT)的目标是将与语言无关的视觉信息融入文本,以提高机器翻译的性能。然而,由于图像和文本之间存在固有差异,这两种模态不可避免地会出现语义不匹配问题。为了解决这个问题,本文采用了一种多粒度视觉枢轴引导的多模态融合策略,并结合跨模态对比解缠,以消除不同语言之间的语言鸿沟。通过将解缠后的多粒度视觉信息用作跨语言枢轴,我们可以增强不同语言之间的对齐,并提高MNMT的性能。我们首先引入文本引导的堆叠跨模态解缠模块,逐步将图像解缠为两种视觉信息:与机器翻译相关的视觉信息和背景信息。然后,我们有效地整合这两种多粒度视觉元素,以辅助目标句子生成。我们在四个基准MNMT数据集上进行了广泛的实验,结果表明,我们提出的方法在所有测试集上都比其他现有最先进(SOTA)方法有显著改进。深入分析突出了文本引导的跨模态解缠和基于视觉枢轴的多模态融合策略在MNMT中的优势。我们将代码发布在https://github.com/nlp-mnmt/ConVisPiv-MNMT上。