IEEE Trans Pattern Anal Mach Intell. 2023 Mar;45(3):3918-3932. doi: 10.1109/TPAMI.2022.3181116. Epub 2023 Feb 3.
The main challenge in the field of unsupervised machine translation (UMT) is to associate source-target sentences in the latent space. As people who speak different languages share biologically similar visual systems, various unsupervised multi-modal machine translation (UMMT) models have been proposed to improve the performances of UMT by employing visual contents in natural images to facilitate alignment. Commonly, relation information is the important semantic in a sentence. Compared with images, videos can better present the interactions between objects and the ways in which an object transforms over time. However, current state-of-the-art methods only explore scene-level or object-level information from images without explicitly modeling objects relation; thus, they are sensitive to spurious correlations, which poses a new challenge for UMMT models. In this paper, we employ a spatial-temporal graph obtained from videos to exploit object interactions in space and time for disambiguation purposes and to promote latent space alignment in UMMT. Our model employs multi-modal back-translation and features pseudo-visual pivoting, in which we learn a shared multilingual visual-semantic embedding space and incorporate visually pivoted captioning as additional weak supervision. Experimental results on the VATEX Translation 2020 and HowToWorld datasets validate the translation capabilities of our model on both sentence-level and word-level and generalizes well when videos are not available during the testing phase.
无监督机器翻译(UMT)领域的主要挑战是在潜在空间中关联源-目标句子。由于说不同语言的人共享生物上相似的视觉系统,因此提出了各种无监督多模态机器翻译(UMMT)模型,通过利用自然图像中的视觉内容来促进对齐,从而提高 UMT 的性能。通常,关系信息是句子中的重要语义。与图像相比,视频可以更好地呈现对象之间的交互以及对象随时间的变化方式。然而,当前最先进的方法仅从图像中探索场景级或对象级信息,而没有显式地对对象关系进行建模;因此,它们容易受到虚假相关性的影响,这为 UMMT 模型提出了新的挑战。在本文中,我们使用从视频中获得的时空图来利用对象在空间和时间上的交互进行去歧义,并促进 UMMT 中的潜在空间对齐。我们的模型采用多模态反向翻译和伪视觉枢轴,其中我们学习共享的多语言视觉-语义嵌入空间,并将视觉枢轴标题作为额外的弱监督纳入其中。在 VATEX Translation 2020 和 HowToWorld 数据集上的实验结果验证了我们的模型在句子级和单词级上的翻译能力,并且在测试阶段没有视频时也能很好地泛化。