Wang Zifeng, Wu Zhenbang, Agarwal Dinesh, Sun Jimeng
Department of Computer Science, University of Illinois Urbana-Champaign.
Adobe.
Proc Conf Empir Methods Nat Lang Process. 2022 Dec;2022:3876-3887. doi: 10.18653/v1/2022.emnlp-main.256.
Existing vision-text contrastive learning like CLIP (Radford et al., 2021) aims to match the paired image and caption embeddings while pushing others apart, which improves representation transferability and supports zero-shot prediction. However, medical image-text datasets are orders of magnitude below the general images and captions from the internet. Moreover, previous methods encounter many false negatives, i.e., images and reports from separate patients probably carry the same semantics but are wrongly treated as negatives. In this paper, we decouple images and texts for multimodal contrastive learning thus scaling the usable training data in a combinatorial magnitude with low cost. We also propose to replace the InfoNCE loss with semantic matching loss based on medical knowledge to eliminate false negatives in contrastive learning. We prove that MedCLIP is a simple yet effective framework: it outperforms state-of-the-art methods on zero-shot prediction, supervised classification, and image-text retrieval. Surprisingly, we observe that with only 20K pre-training data, MedCLIP wins over the state-of-the-art method (using ≈200K data).
现有的视觉-文本对比学习方法,如CLIP(拉德福德等人,2021年),旨在匹配配对的图像和标题嵌入,同时将其他嵌入分开,这提高了表示的可迁移性并支持零样本预测。然而,医学图像-文本数据集比来自互联网的一般图像和标题少几个数量级。此外,以前的方法会遇到许多假阴性情况,即来自不同患者的图像和报告可能具有相同的语义,但却被错误地视为阴性。在本文中,我们将图像和文本解耦以进行多模态对比学习,从而以低成本在组合量级上扩展可用训练数据。我们还建议用基于医学知识的语义匹配损失取代InfoNCE损失,以消除对比学习中的假阴性。我们证明MedCLIP是一个简单而有效的框架:它在零样本预测、监督分类和图像-文本检索方面优于现有方法。令人惊讶的是,我们观察到,仅使用20K预训练数据,MedCLIP就超过了现有方法(使用约200K数据)。