Calixto Iacer, Liu Qun
1University of Amsterdam, ILLC, Science Park, Amsterdam, Netherlands.
Huawei Noah's Ark Lab, Hong Kong, Hong Kong.
Mach Transl. 2019;33(1):155-177. doi: 10.1007/s10590-019-09226-9. Epub 2019 Apr 8.
In this article, we conduct an extensive quantitative error analysis of different multi-modal neural machine translation (MNMT) models which integrate visual features into different parts of both the encoder and the decoder. We investigate the scenario where models are trained on an in-domain training data set of parallel sentence pairs with images. We analyse two different types of MNMT models, that use and image features: the latter encode an image globally, i.e. there is one feature vector representing an entire image, whereas the former encode spatial information, i.e. there are multiple feature vectors, each encoding different portions of the image. We conduct an error analysis of translations generated by different MNMT models as well as text-only baselines, where we study how multi-modal models compare when translating both . In general, we find that the additional multi-modal signals consistently improve translations, even more so when using simpler MNMT models that use global visual features. We also find that not only translations of terms with a strong visual connotation are improved, but almost all kinds of errors decreased when using multi-modal models.
在本文中,我们对不同的多模态神经机器翻译(MNMT)模型进行了广泛的定量误差分析,这些模型将视觉特征集成到编码器和解码器的不同部分。我们研究了在带有图像的领域内平行句子对训练数据集上训练模型的情况。我们分析了两种不同类型的使用图像特征的MNMT模型:后者对图像进行全局编码,即有一个表示整个图像的特征向量,而前者对空间信息进行编码,即有多个特征向量,每个特征向量对图像的不同部分进行编码。我们对不同MNMT模型以及纯文本基线生成的翻译进行了误差分析,研究了在翻译这两种情况时多模态模型的比较情况。总体而言,我们发现额外的多模态信号始终能改进翻译,在使用使用全局视觉特征的更简单MNMT模型时更是如此。我们还发现,不仅具有强烈视觉内涵的术语的翻译得到了改进,而且使用多模态模型时几乎所有类型的错误都减少了。