Liu Che, Cheng Sibo, Shi Miaojing, Shah Anand, Bai Wenjia, Arcucci Rossella
IEEE Trans Med Imaging. 2025 Jan;44(1):519-529. doi: 10.1109/TMI.2024.3449690. Epub 2025 Jan 2.
In medical Vision-Language Pre-training (VLP), significant work focuses on extracting text and image features from clinical reports and medical images. Yet, existing methods may overlooked the potential of the natural hierarchical structure in clinical reports, typically divided into 'findings' for description and 'impressions' for conclusions. Current VLP approaches tend to oversimplify these reports into a single entity or fragmented tokens, ignoring this structured format. In this work, we propose a novel clinical prior guided VLP framework named IMITATE to learn the structure information from medical reports with hierarchical vision-language alignment. The framework derives multi-level visual features from the chest X-ray (CXR) images and separately aligns these features with the descriptive and the conclusive text encoded in the hierarchical medical report. Furthermore, a new clinical-informed contrastive loss is introduced for cross-modal learning, which accounts for clinical prior knowledge in formulating sample correlations in contrastive learning. The proposed model, IMITATE, outperforms baseline VLP methods across six different datasets, spanning five medical imaging downstream tasks. Experimental results show benefits of using hierarchical structures in medical reports for VLP. Code: https://github.com/cheliu-computation/IMITATE-TMI2024.
在医学视觉语言预训练(VLP)中,大量工作聚焦于从临床报告和医学图像中提取文本和图像特征。然而,现有方法可能忽略了临床报告中自然层次结构的潜力,临床报告通常分为用于描述的“发现”和用于结论的“印象”。当前的VLP方法倾向于将这些报告过度简化为单个实体或碎片化的令牌,而忽略了这种结构化格式。在这项工作中,我们提出了一种名为IMITATE的新型临床先验引导VLP框架,以通过层次化视觉语言对齐从医学报告中学习结构信息。该框架从胸部X光(CXR)图像中导出多级视觉特征,并将这些特征分别与分层医学报告中编码的描述性文本和结论性文本对齐。此外,还引入了一种新的临床信息对比损失用于跨模态学习,该损失在对比学习中制定样本相关性时考虑了临床先验知识。所提出的模型IMITATE在跨越五个医学成像下游任务的六个不同数据集上优于基线VLP方法。实验结果表明在医学报告中使用层次结构进行VLP的好处。代码:https://github.com/cheliu-computation/IMITATE-TMI2024