Du Yuexi, Chang Brian, Dvornek Nicha C
Department of Biomedical Engineering, Yale University, New Haven, CT, USA.
Department of Radiology & Biomedical Imaging, Yale University, New Haven, CT, USA.
Med Image Comput Comput Assist Interv. 2024 Oct;15012:465-475. doi: 10.1007/978-3-031-72390-2_44. Epub 2024 Oct 23.
Recent advancements in Contrastive Language-Image Pre-training (CLIP) [21] have demonstrated notable success in self-supervised representation learning across various tasks. However, the existing CLIP-like approaches often demand extensive GPU resources and prolonged training times due to the considerable size of the model and dataset, making them poor for medical applications, in which large datasets are not always common. Meanwhile, the language model prompts are mainly manually derived from labels tied to images, potentially overlooking the richness of information within training samples. We introduce a novel language-image Contrastive Learning method with an Efficient large language model and prompt Fine-Tuning (CLEFT) that harnesses the strengths of the extensive pre-trained language and visual models. Furthermore, we present an efficient strategy for learning context-based prompts that mitigates the gap between informative clinical diagnostic data and simple class labels. Our method demonstrates state-of-the-art performance on multiple chest X-ray and mammography datasets compared with various baselines. The proposed parameter efficient framework can reduce the total trainable model size by 39% and reduce the trainable language model to only 4% compared with the current BERT encoder.
对比语言-图像预训练(CLIP)[21]的最新进展在各种任务的自监督表示学习中取得了显著成功。然而,现有的类似CLIP的方法由于模型和数据集规模较大,通常需要大量的GPU资源和较长的训练时间,这使得它们在医学应用中表现不佳,因为在医学应用中大数据集并不总是常见的。同时,语言模型提示主要是从与图像相关的标签中手动推导出来的,这可能会忽略训练样本中的丰富信息。我们引入了一种新颖的语言-图像对比学习方法,即高效大语言模型和提示微调(CLEFT),该方法利用了广泛的预训练语言和视觉模型的优势。此外,我们提出了一种学习基于上下文提示的有效策略,以缩小信息丰富的临床诊断数据与简单类别标签之间的差距。与各种基线相比,我们的方法在多个胸部X光和乳腺X光数据集上展示了领先的性能。与当前的BERT编码器相比,所提出的参数高效框架可以将可训练模型的总大小减少39%,并将可训练语言模型减少到仅4%。