Lan Zhibin, Yu Jiawei, Liu Shiyu, Yao Junfeng, Huang Degen, Su Jinsong
School of Informatics, Xiamen University, Xiamen, 361005, Fujian, China; Key Laboratory of Digital Protection and Intelligent Processing of Intangible Cultural Heritage of Fujian and Taiwan, Ministry of Culture and Tourism, Xiamen, 361005, Fujian, China.
School of Film, Xiamen University, Xiamen, 361005, Fujian, China.
Neural Netw. 2025 Sep;189:107599. doi: 10.1016/j.neunet.2025.107599. Epub 2025 May 23.
As a widely-used machine translation task, text image machine translation (TIMT) aims to translate the source texts embedded in the image to target translations. However, studies in this aspect face two challenges: (1) constructed in a cascaded manner, dominant models suffer from the error propagation of optical character recognition (OCR), and (2) they lack publicly available large-scale datasets. To deal with these issues, we propose a multimodal codebook based TIMT model. In addition to a text encoder, an image encoder, and a text decoder, our model is equipped with a multimodal codebook that effectively associates images with relevant texts, thus providing useful supplementary information for translation. Particularly, we present a multi-stage training framework to fully exploit various datasets to effectively train our model. Concretely, we first conduct preliminary training on the text encoder and decoder using bilingual texts. Subsequently, via an additional code-conditioned mask translation task, we use the bilingual texts to continuously train the text encoder, multimodal codebook, and decoder. Afterwards, by further introducing an image-text alignment task and adversarial training, we train the whole model except for the text decoder on the OCR dataset. Finally, through the above training tasks except for text translation, we adopt a TIMT dataset to fine-tune the whole model. Besides, we manually annotate a Chinese-English TIMT dataset, named OCRMT30K, and extend it to Chinese-German TIMT dataset through an automatic translation tool. To the best of our knowledge, it is the first public manually-annotated TIMT dataset, which facilitates future studies in this task. To investigate the effectiveness of our model, we conduct extensive experiments on Chinese-English and Chinese-German TIMT tasks. Experimental results and in-depth analyses strongly demonstrate the effectiveness of our model. We release our code and dataset on https://github.com/DeepLearnXMU/mc_tit.
作为一种广泛应用的机器翻译任务,文本图像机器翻译(TIMT)旨在将嵌入图像中的源文本翻译为目标译文。然而,这方面的研究面临两个挑战:(1)主流模型以级联方式构建,受光学字符识别(OCR)的错误传播影响,(2)它们缺乏公开可用的大规模数据集。为解决这些问题,我们提出一种基于多模态码本的TIMT模型。除了文本编码器、图像编码器和解码器外,我们的模型还配备了一个多模态码本,可有效地将图像与相关文本关联起来,从而为翻译提供有用的补充信息。特别地,我们提出了一个多阶段训练框架,以充分利用各种数据集来有效训练我们的模型。具体而言,我们首先使用双语文本对文本编码器和解码器进行初步训练。随后,通过一个额外的代码条件掩码翻译任务,我们使用双语文本持续训练文本编码器、多模态码本和解码器。之后,通过进一步引入图像-文本对齐任务和对抗训练,我们在OCR数据集上训练除文本解码器之外的整个模型。最后,通过除文本翻译之外的上述训练任务,我们采用一个TIMT数据集对整个模型进行微调。此外,我们手动标注了一个汉英TIMT数据集,名为OCRMT30K,并通过自动翻译工具将其扩展为汉德TIMT数据集。据我们所知,这是首个公开的手动标注TIMT数据集,有助于推动该任务的未来研究。为研究我们模型的有效性,我们在汉英和汉德TIMT任务上进行了广泛实验。实验结果和深入分析有力地证明了我们模型的有效性。我们在https://github.com/DeepLearnXMU/mc_tit上发布了我们的代码和数据集。