利用多模态码本和多阶段训练实现更好的文本图像机器翻译。

Towards better text image machine translation with multimodal codebook and multi-stage training.

作者信息

Lan Zhibin, Yu Jiawei, Liu Shiyu, Yao Junfeng, Huang Degen, Su Jinsong

机构信息

School of Informatics, Xiamen University, Xiamen, 361005, Fujian, China; Key Laboratory of Digital Protection and Intelligent Processing of Intangible Cultural Heritage of Fujian and Taiwan, Ministry of Culture and Tourism, Xiamen, 361005, Fujian, China.

School of Film, Xiamen University, Xiamen, 361005, Fujian, China.

出版信息

Neural Netw. 2025 Sep;189:107599. doi: 10.1016/j.neunet.2025.107599. Epub 2025 May 23.

DOI:10.1016/j.neunet.2025.107599

PMID:40435555

Abstract

As a widely-used machine translation task, text image machine translation (TIMT) aims to translate the source texts embedded in the image to target translations. However, studies in this aspect face two challenges: (1) constructed in a cascaded manner, dominant models suffer from the error propagation of optical character recognition (OCR), and (2) they lack publicly available large-scale datasets. To deal with these issues, we propose a multimodal codebook based TIMT model. In addition to a text encoder, an image encoder, and a text decoder, our model is equipped with a multimodal codebook that effectively associates images with relevant texts, thus providing useful supplementary information for translation. Particularly, we present a multi-stage training framework to fully exploit various datasets to effectively train our model. Concretely, we first conduct preliminary training on the text encoder and decoder using bilingual texts. Subsequently, via an additional code-conditioned mask translation task, we use the bilingual texts to continuously train the text encoder, multimodal codebook, and decoder. Afterwards, by further introducing an image-text alignment task and adversarial training, we train the whole model except for the text decoder on the OCR dataset. Finally, through the above training tasks except for text translation, we adopt a TIMT dataset to fine-tune the whole model. Besides, we manually annotate a Chinese-English TIMT dataset, named OCRMT30K, and extend it to Chinese-German TIMT dataset through an automatic translation tool. To the best of our knowledge, it is the first public manually-annotated TIMT dataset, which facilitates future studies in this task. To investigate the effectiveness of our model, we conduct extensive experiments on Chinese-English and Chinese-German TIMT tasks. Experimental results and in-depth analyses strongly demonstrate the effectiveness of our model. We release our code and dataset on https://github.com/DeepLearnXMU/mc_tit.

摘要

作为一种广泛应用的机器翻译任务，文本图像机器翻译（TIMT）旨在将嵌入图像中的源文本翻译为目标译文。然而，这方面的研究面临两个挑战：（1）主流模型以级联方式构建，受光学字符识别（OCR）的错误传播影响，（2）它们缺乏公开可用的大规模数据集。为解决这些问题，我们提出一种基于多模态码本的TIMT模型。除了文本编码器、图像编码器和解码器外，我们的模型还配备了一个多模态码本，可有效地将图像与相关文本关联起来，从而为翻译提供有用的补充信息。特别地，我们提出了一个多阶段训练框架，以充分利用各种数据集来有效训练我们的模型。具体而言，我们首先使用双语文本对文本编码器和解码器进行初步训练。随后，通过一个额外的代码条件掩码翻译任务，我们使用双语文本持续训练文本编码器、多模态码本和解码器。之后，通过进一步引入图像-文本对齐任务和对抗训练，我们在OCR数据集上训练除文本解码器之外的整个模型。最后，通过除文本翻译之外的上述训练任务，我们采用一个TIMT数据集对整个模型进行微调。此外，我们手动标注了一个汉英TIMT数据集，名为OCRMT30K，并通过自动翻译工具将其扩展为汉德TIMT数据集。据我们所知，这是首个公开的手动标注TIMT数据集，有助于推动该任务的未来研究。为研究我们模型的有效性，我们在汉英和汉德TIMT任务上进行了广泛实验。实验结果和深入分析有力地证明了我们模型的有效性。我们在https://github.com/DeepLearnXMU/mc_tit上发布了我们的代码和数据集。

相似文献

Towards better text image machine translation with multimodal codebook and multi-stage training.利用多模态码本和多阶段训练实现更好的文本图像机器翻译。

Neural Netw. 2025 Sep;189:107599. doi: 10.1016/j.neunet.2025.107599. Epub 2025 May 23.

Small training dataset convolutional neural networks for application-specific super-resolution microscopy.针对特定应用的超分辨率显微镜的小训练数据集卷积神经网络。

J Biomed Opt. 2023 Mar;28(3):036501. doi: 10.1117/1.JBO.28.3.036501. Epub 2023 Mar 14.

LMISA: A lightweight multi-modality image segmentation network via domain adaptation using gradient magnitude and shape constraint.LMISA：一种基于梯度幅度和形状约束的域自适应轻量级多模态图像分割网络。

Med Image Anal. 2022 Oct;81:102536. doi: 10.1016/j.media.2022.102536. Epub 2022 Jul 13.

Structured Multimodal Attentions for TextVQA.面向文本视觉问答的结构化多模态注意力

IEEE Trans Pattern Anal Mach Intell. 2022 Dec;44(12):9603-9614. doi: 10.1109/TPAMI.2021.3132034. Epub 2022 Nov 7.

Text-guided Image Restoration and Semantic Enhancement for Text-to-Image Person Retrieval.用于文本到图像人物检索的文本引导图像恢复与语义增强

Neural Netw. 2025 Apr;184:107028. doi: 10.1016/j.neunet.2024.107028. Epub 2024 Dec 16.

Decoding Digital Discourse Through Multimodal Text and Image Machine Learning Models to Classify Sentiment and Detect Hate Speech in Race- and Lesbian, Gay, Bisexual, Transgender, Queer, Intersex, and Asexual Community-Related Posts on Social Media: Quantitative Study.通过多模态文本和图像机器学习模型解码数字话语，以对社交媒体上与种族以及女同性恋、男同性恋、双性恋、跨性别者、酷儿、双性人及无性恋者群体相关帖子中的情感进行分类并检测仇恨言论：定量研究

J Med Internet Res. 2025 May 12;27:e72822. doi: 10.2196/72822.

Towards annotation-efficient segmentation via image-to-image translation.通过图像到图像的转换实现标注高效的分割。

Med Image Anal. 2022 Nov;82:102624. doi: 10.1016/j.media.2022.102624. Epub 2022 Sep 21.

SAMCT: Segment Any CT Allowing Labor-Free Task-Indicator Prompts.SAMCT：支持无人工任务指标提示的任意CT分割

IEEE Trans Med Imaging. 2025 Mar;44(3):1386-1399. doi: 10.1109/TMI.2024.3493456. Epub 2025 Mar 17.

Is image-to-image translation the panacea for multimodal image registration? A comparative study.图像到图像的翻译是否是多模态图像配准的万能药？一项对比研究。

PLoS One. 2022 Nov 28;17(11):e0276196. doi: 10.1371/journal.pone.0276196. eCollection 2022.

C MAL: cascaded network-guided class-balanced multi-prototype auxiliary learning for source-free domain adaptive medical image segmentation.C MAL：用于无源域自适应医学图像分割的级联网络引导的类平衡多原型辅助学习

Med Biol Eng Comput. 2025 May;63(5):1551-1570. doi: 10.1007/s11517-025-03287-0. Epub 2025 Jan 20.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

利用多模态码本和多阶段训练实现更好的文本图像机器翻译。

Towards better text image machine translation with multimodal codebook and multi-stage training.

作者信息

机构信息

出版信息

相似文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献