• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

通过多模态对齐概念知识实现非配对图像-文本匹配

Unpaired Image-Text Matching via Multimodal Aligned Conceptual Knowledge.

作者信息

Huang Yan, Wang Yuming, Zeng Yunan, Huang Junshi, Chai Zhenhua, Wang Liang

出版信息

IEEE Trans Pattern Anal Mach Intell. 2025 Jul;47(7):5160-5176. doi: 10.1109/TPAMI.2024.3432552.

DOI:10.1109/TPAMI.2024.3432552
PMID:39042537
Abstract

Recently, the accuracy of image-text matching has been greatly improved by multimodal pretrained models, all of which use millions or billions of paired images and texts for supervised model learning. Different from them, human brains can well match images with texts using their stored multimodal knowledge. Inspired by that, this paper studies a new scenario as unpaired image-text matching, in which paired images and texts are assumed to be unavailable during model learning. To deal with it, we accordingly propose a simple yet effective method namely Multimodal Aligned Conceptual Knowledge (MACK). First, we collect a set of words and their related image regions from publicly available datasets, and compute prototypical region representations to obtain pretrained general knowledge. To make the obtained knowledge better suit for certain datasets, we refine it using unpaired images and texts in a self-supervised learning manner to obtain fine-tuned domain knowledge. Then, to match given images with texts based on the knowledge, we represent parsed words in the texts by prototypical region representations, and compute region-word similarity scores. At last, the scores are aggregated based on bidirectional similarity pooling into an image-text similarity score, which can be directly used for unpaired image-text matching. The proposed MACK is complementary with existing models, which can be easily extended as a re-ranking method to substantially improve their performance of zero-shot and cross-dataset image-text matching.

摘要

最近,多模态预训练模型极大地提高了图像-文本匹配的准确性,所有这些模型都使用数百万或数十亿对配对图像和文本进行监督模型学习。与它们不同的是,人类大脑可以利用其存储的多模态知识将图像与文本很好地匹配。受此启发,本文研究了一种新的场景——非配对图像-文本匹配,即在模型学习期间假设没有配对图像和文本。为了解决这个问题,我们相应地提出了一种简单而有效的方法,即多模态对齐概念知识(MACK)。首先,我们从公开可用的数据集中收集一组单词及其相关的图像区域,并计算原型区域表示以获得预训练的通用知识。为了使获得的知识更好地适用于特定数据集,我们以自监督学习的方式使用非配对图像和文本对其进行细化,以获得微调后的领域知识。然后,为了基于该知识将给定图像与文本进行匹配,我们用原型区域表示来表示文本中解析出的单词,并计算区域-单词相似度分数。最后,基于双向相似度池化对分数进行聚合,得到一个图像-文本相似度分数,该分数可直接用于非配对图像-文本匹配。所提出的MACK与现有模型互补,可以很容易地扩展为一种重新排序方法,以大幅提高其零样本和跨数据集图像-文本匹配的性能。

相似文献

1
Unpaired Image-Text Matching via Multimodal Aligned Conceptual Knowledge.通过多模态对齐概念知识实现非配对图像-文本匹配
IEEE Trans Pattern Anal Mach Intell. 2025 Jul;47(7):5160-5176. doi: 10.1109/TPAMI.2024.3432552.
2
MedCLIP: Contrastive Learning from Unpaired Medical Images and Text.MedCLIP:从未配对医学图像和文本中进行对比学习。
Proc Conf Empir Methods Nat Lang Process. 2022 Dec;2022:3876-3887. doi: 10.18653/v1/2022.emnlp-main.256.
3
Few-Shot Image and Sentence Matching via Aligned Cross-Modal Memory.通过对齐跨模态记忆实现少样本图像与句子匹配
IEEE Trans Pattern Anal Mach Intell. 2022 Jun;44(6):2968-2983. doi: 10.1109/TPAMI.2021.3052490. Epub 2022 May 5.
4
Novel cross-dimensional coarse-fine-grained complementary network for image-text matching.用于图像-文本匹配的新型跨维度粗细粒度互补网络。
PeerJ Comput Sci. 2025 Mar 3;11:e2725. doi: 10.7717/peerj-cs.2725. eCollection 2025.
5
Learning Aligned Image-Text Representations Using Graph Attentive Relational Network.使用图注意力关系网络学习对齐的图像-文本表示
IEEE Trans Image Process. 2021;30:1840-1852. doi: 10.1109/TIP.2020.3048627. Epub 2021 Jan 18.
6
Towards better text image machine translation with multimodal codebook and multi-stage training.利用多模态码本和多阶段训练实现更好的文本图像机器翻译。
Neural Netw. 2025 Sep;189:107599. doi: 10.1016/j.neunet.2025.107599. Epub 2025 May 23.
7
A self-supervised guided knowledge distillation framework for unpaired low-dose CT image denoising.一种用于非配对低剂量 CT 图像去噪的自监督引导知识蒸馏框架。
Comput Med Imaging Graph. 2023 Jul;107:102237. doi: 10.1016/j.compmedimag.2023.102237. Epub 2023 Apr 23.
8
A modality-collaborative convolution and transformer hybrid network for unpaired multi-modal medical image segmentation with limited annotations.一种用于具有有限标注的未配对多模态医学图像分割的模态协作卷积与Transformer混合网络。
Med Phys. 2023 Sep;50(9):5460-5478. doi: 10.1002/mp.16338. Epub 2023 Mar 15.
9
A multimodal similarity-aware and knowledge-driven pre-training approach for reliable pneumoconiosis diagnosis.一种用于可靠尘肺病诊断的多模态相似性感知和知识驱动的预训练方法。
J Xray Sci Technol. 2025 Jan;33(1):229-248. doi: 10.1177/08953996241296400. Epub 2025 Jan 13.
10
Quasi-supervised MR-CT image conversion based on unpaired data.基于非配对数据的准监督磁共振成像-计算机断层扫描图像转换
Phys Med Biol. 2025 Jun 17;70(12). doi: 10.1088/1361-6560/ade220.