通过多模态对齐概念知识实现非配对图像-文本匹配

Unpaired Image-Text Matching via Multimodal Aligned Conceptual Knowledge.

作者信息

Huang Yan, Wang Yuming, Zeng Yunan, Huang Junshi, Chai Zhenhua, Wang Liang

出版信息

IEEE Trans Pattern Anal Mach Intell. 2025 Jul;47(7):5160-5176. doi: 10.1109/TPAMI.2024.3432552.

DOI:10.1109/TPAMI.2024.3432552

Abstract

Recently, the accuracy of image-text matching has been greatly improved by multimodal pretrained models, all of which use millions or billions of paired images and texts for supervised model learning. Different from them, human brains can well match images with texts using their stored multimodal knowledge. Inspired by that, this paper studies a new scenario as unpaired image-text matching, in which paired images and texts are assumed to be unavailable during model learning. To deal with it, we accordingly propose a simple yet effective method namely Multimodal Aligned Conceptual Knowledge (MACK). First, we collect a set of words and their related image regions from publicly available datasets, and compute prototypical region representations to obtain pretrained general knowledge. To make the obtained knowledge better suit for certain datasets, we refine it using unpaired images and texts in a self-supervised learning manner to obtain fine-tuned domain knowledge. Then, to match given images with texts based on the knowledge, we represent parsed words in the texts by prototypical region representations, and compute region-word similarity scores. At last, the scores are aggregated based on bidirectional similarity pooling into an image-text similarity score, which can be directly used for unpaired image-text matching. The proposed MACK is complementary with existing models, which can be easily extended as a re-ranking method to substantially improve their performance of zero-shot and cross-dataset image-text matching.

摘要

最近，多模态预训练模型极大地提高了图像-文本匹配的准确性，所有这些模型都使用数百万或数十亿对配对图像和文本进行监督模型学习。与它们不同的是，人类大脑可以利用其存储的多模态知识将图像与文本很好地匹配。受此启发，本文研究了一种新的场景——非配对图像-文本匹配，即在模型学习期间假设没有配对图像和文本。为了解决这个问题，我们相应地提出了一种简单而有效的方法，即多模态对齐概念知识（MACK）。首先，我们从公开可用的数据集中收集一组单词及其相关的图像区域，并计算原型区域表示以获得预训练的通用知识。为了使获得的知识更好地适用于特定数据集，我们以自监督学习的方式使用非配对图像和文本对其进行细化，以获得微调后的领域知识。然后，为了基于该知识将给定图像与文本进行匹配，我们用原型区域表示来表示文本中解析出的单词，并计算区域-单词相似度分数。最后，基于双向相似度池化对分数进行聚合，得到一个图像-文本相似度分数，该分数可直接用于非配对图像-文本匹配。所提出的MACK与现有模型互补，可以很容易地扩展为一种重新排序方法，以大幅提高其零样本和跨数据集图像-文本匹配的性能。

相似文献

Unpaired Image-Text Matching via Multimodal Aligned Conceptual Knowledge.

IEEE Trans Pattern Anal Mach Intell. 2025 Jul;47(7):5160-5176. doi: 10.1109/TPAMI.2024.3432552.

MedCLIP: Contrastive Learning from Unpaired Medical Images and Text.

Proc Conf Empir Methods Nat Lang Process. 2022 Dec;2022:3876-3887. doi: 10.18653/v1/2022.emnlp-main.256.

Few-Shot Image and Sentence Matching via Aligned Cross-Modal Memory.

IEEE Trans Pattern Anal Mach Intell. 2022 Jun;44(6):2968-2983. doi: 10.1109/TPAMI.2021.3052490. Epub 2022 May 5.

Novel cross-dimensional coarse-fine-grained complementary network for image-text matching.

PeerJ Comput Sci. 2025 Mar 3;11:e2725. doi: 10.7717/peerj-cs.2725. eCollection 2025.

Learning Aligned Image-Text Representations Using Graph Attentive Relational Network.

IEEE Trans Image Process. 2021;30:1840-1852. doi: 10.1109/TIP.2020.3048627. Epub 2021 Jan 18.

Towards better text image machine translation with multimodal codebook and multi-stage training.

Neural Netw. 2025 Sep;189:107599. doi: 10.1016/j.neunet.2025.107599. Epub 2025 May 23.

A self-supervised guided knowledge distillation framework for unpaired low-dose CT image denoising.

Comput Med Imaging Graph. 2023 Jul;107:102237. doi: 10.1016/j.compmedimag.2023.102237. Epub 2023 Apr 23.

A modality-collaborative convolution and transformer hybrid network for unpaired multi-modal medical image segmentation with limited annotations.

Med Phys. 2023 Sep;50(9):5460-5478. doi: 10.1002/mp.16338. Epub 2023 Mar 15.

J Xray Sci Technol. 2025 Jan;33(1):229-248. doi: 10.1177/08953996241296400. Epub 2025 Jan 13.

Quasi-supervised MR-CT image conversion based on unpaired data.

Phys Med Biol. 2025 Jun 17;70(12). doi: 10.1088/1361-6560/ade220.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

通过多模态对齐概念知识实现非配对图像-文本匹配

Unpaired Image-Text Matching via Multimodal Aligned Conceptual Knowledge.

作者信息

出版信息

相似文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献