Wang Rongsheng, Yao Qingsong, Jiang Zihang, Lai Haoran, He Zhiyang, Tao Xiaodong, Zhou S Kevin
School of Biomedical Engineering, Division of Life Sciences and Medicine, University of Science and Technology of China (USTC), Hefei Anhui, 230026, China; Center for Medical Imaging, Robotics, Analytic Computing & Learning (MIRACLE), Suzhou Institute for Advance Research, USTC, Suzhou Jiangsu, 215123, China; Anhui IFLYTEK CO., Ltd., China.
Stanford University, Palo Alto, CA, 94025, United States.
Med Image Anal. 2025 Jun 26;105:103690. doi: 10.1016/j.media.2025.103690.
Despite significant advancements in medical vision-language pre-training, existing methods have largely overlooked the inherent linguistic complexity and imbalanced issue within medical reports, as well as the complex cross-modality contextual relationships between texts and images. To close this gap, we propose a novel Entity-centered Context-aware Medical Vision-language Pre-training (ECAMP) framework, which establishes a more entity-centered, context-sensitive, and balanced understanding of medical reports to effectively pre-train the vision encoder. We first distill entity-centered context from medical reports utilizing large language models, enabling ECAMP to draw more precise supervision from the text modality. By further incorporating entity-aware re-balanced factor and descriptor masking strategies into masked language modeling, ECAMP significantly enhances the knowledge of entities within the reports. A context-guided super-resolution task is proposed alongside a multi-scale context fusion design to improve the semantic integration of both coarse and fine-level image representations, which prompts better performance for multi-scale downstream applications. ECAMP integrates these innovations together, leading to significant performance leaps over current state-of-the-art methods and establish a new standard for cross-modality pre-training in medical imaging. The effectiveness of ECAMP is demonstrated by extensive experiments on various domains and organs, which achieves cutting-edge results on multiple tasks including classification, segmentation, and detection across 5 public chest X-ray and 4 fundoscopy datasets respectively.
尽管医学视觉语言预训练取得了显著进展,但现有方法在很大程度上忽略了医学报告中固有的语言复杂性和不平衡问题,以及文本与图像之间复杂的跨模态上下文关系。为了弥补这一差距,我们提出了一种新颖的以实体为中心的上下文感知医学视觉语言预训练(ECAMP)框架,该框架建立了一种更加以实体为中心、上下文敏感且平衡的医学报告理解方式,以有效地预训练视觉编码器。我们首先利用大语言模型从医学报告中提取以实体为中心的上下文,使ECAMP能够从文本模态中获得更精确的监督。通过进一步将实体感知重新平衡因子和描述符掩码策略纳入掩码语言建模,ECAMP显著增强了报告中实体的知识。同时提出了一个上下文引导的超分辨率任务以及多尺度上下文融合设计,以改善粗粒度和细粒度图像表示的语义整合,从而在多尺度下游应用中实现更好的性能。ECAMP将这些创新整合在一起,相比当前的最先进方法实现了显著的性能飞跃,并为医学成像中的跨模态预训练树立了新的标准。在各种领域和器官上进行的广泛实验证明了ECAMP的有效性,它在分别跨越5个公共胸部X光和4个眼底镜检查数据集的分类、分割和检测等多个任务上取得了前沿成果。