Paul C. Lauterbur Research Center for Biomedical Imaging, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China; Peng Cheng Laboratory, Shenzhen 518066, China; University of Chinese Academy of Sciences, Beijing 100049, China.
Paul C. Lauterbur Research Center for Biomedical Imaging, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China.
Med Image Anal. 2024 Oct;97:103299. doi: 10.1016/j.media.2024.103299. Epub 2024 Aug 13.
Recently, vision-language representation learning has made remarkable advancements in building up medical foundation models, holding immense potential for transforming the landscape of clinical research and medical care. The underlying hypothesis is that the rich knowledge embedded in radiology reports can effectively assist and guide the learning process, reducing the need for additional labels. However, these reports tend to be complex and sometimes even consist of redundant descriptions that make the representation learning too challenging to capture the key semantic information. This paper develops a novel iterative vision-language representation learning framework by proposing a key semantic knowledge-emphasized report refinement method. Particularly, raw radiology reports are refined to highlight the key information according to a constructed clinical dictionary and two model-optimized knowledge-enhancement metrics. The iterative framework is designed to progressively learn, starting from gaining a general understanding of the patient's condition based on raw reports and gradually refines and extracts critical information essential to the fine-grained analysis tasks. The effectiveness of the proposed framework is validated on various downstream medical image analysis tasks, including disease classification, region-of-interest segmentation, and phrase grounding. Our framework surpasses seven state-of-the-art methods in both fine-tuning and zero-shot settings, demonstrating its encouraging potential for different clinical applications.
最近,视觉-语言表示学习在构建医学基础模型方面取得了显著进展,为临床研究和医疗保健领域带来了巨大的变革潜力。其基本假设是,放射报告中嵌入的丰富知识可以有效地辅助和指导学习过程,减少对额外标签的需求。然而,这些报告往往比较复杂,有时甚至包含冗余的描述,这使得表示学习很难捕捉到关键的语义信息。本文通过提出一种关键语义知识强调的报告细化方法,开发了一种新颖的迭代视觉-语言表示学习框架。具体来说,根据构建的临床词典和两个模型优化的知识增强指标,对原始放射学报告进行细化,以突出关键信息。该迭代框架旨在从基于原始报告对患者病情进行大致了解开始,逐步学习,然后逐渐细化和提取对细粒度分析任务至关重要的关键信息。在所提出的框架上验证了各种下游医学图像分析任务的有效性,包括疾病分类、感兴趣区域分割和短语定位。在微调和零样本设置中,我们的框架均优于七种最先进的方法,这表明它在不同的临床应用中具有令人鼓舞的潜力。