Suppr超能文献

通过多模态对齐概念知识实现非配对图像-文本匹配

Unpaired Image-Text Matching via Multimodal Aligned Conceptual Knowledge.

作者信息

Huang Yan, Wang Yuming, Zeng Yunan, Huang Junshi, Chai Zhenhua, Wang Liang

出版信息

IEEE Trans Pattern Anal Mach Intell. 2025 Jul;47(7):5160-5176. doi: 10.1109/TPAMI.2024.3432552.

Abstract

Recently, the accuracy of image-text matching has been greatly improved by multimodal pretrained models, all of which use millions or billions of paired images and texts for supervised model learning. Different from them, human brains can well match images with texts using their stored multimodal knowledge. Inspired by that, this paper studies a new scenario as unpaired image-text matching, in which paired images and texts are assumed to be unavailable during model learning. To deal with it, we accordingly propose a simple yet effective method namely Multimodal Aligned Conceptual Knowledge (MACK). First, we collect a set of words and their related image regions from publicly available datasets, and compute prototypical region representations to obtain pretrained general knowledge. To make the obtained knowledge better suit for certain datasets, we refine it using unpaired images and texts in a self-supervised learning manner to obtain fine-tuned domain knowledge. Then, to match given images with texts based on the knowledge, we represent parsed words in the texts by prototypical region representations, and compute region-word similarity scores. At last, the scores are aggregated based on bidirectional similarity pooling into an image-text similarity score, which can be directly used for unpaired image-text matching. The proposed MACK is complementary with existing models, which can be easily extended as a re-ranking method to substantially improve their performance of zero-shot and cross-dataset image-text matching.

摘要

最近,多模态预训练模型极大地提高了图像-文本匹配的准确性,所有这些模型都使用数百万或数十亿对配对图像和文本进行监督模型学习。与它们不同的是,人类大脑可以利用其存储的多模态知识将图像与文本很好地匹配。受此启发,本文研究了一种新的场景——非配对图像-文本匹配,即在模型学习期间假设没有配对图像和文本。为了解决这个问题,我们相应地提出了一种简单而有效的方法,即多模态对齐概念知识(MACK)。首先,我们从公开可用的数据集中收集一组单词及其相关的图像区域,并计算原型区域表示以获得预训练的通用知识。为了使获得的知识更好地适用于特定数据集,我们以自监督学习的方式使用非配对图像和文本对其进行细化,以获得微调后的领域知识。然后,为了基于该知识将给定图像与文本进行匹配,我们用原型区域表示来表示文本中解析出的单词,并计算区域-单词相似度分数。最后,基于双向相似度池化对分数进行聚合,得到一个图像-文本相似度分数,该分数可直接用于非配对图像-文本匹配。所提出的MACK与现有模型互补,可以很容易地扩展为一种重新排序方法,以大幅提高其零样本和跨数据集图像-文本匹配的性能。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验