Ryu Ji Seung, Kang Hyunyoung, Chu Yuseong, Yang Sejung
Department of Precision Medicine, Yonsei University Wonju College of Medicine, Wonju, Korea.
Department of Medical Informatics and Biostatistics, Yonsei University Wonju College of Medicine, Wonju, Republic of Korea.
Biomed Eng Lett. 2025 Jun 6;15(5):809-830. doi: 10.1007/s13534-025-00484-6. eCollection 2025 Sep.
Foundation models, including large language models and vision-language models (VLMs), have revolutionized artificial intelligence by enabling efficient, scalable, and multimodal learning across diverse applications. By leveraging advancements in self-supervised and semi-supervised learning, these models integrate computer vision and natural language processing to address complex tasks, such as disease classification, segmentation, cross-modal retrieval, and automated report generation. Their ability to pretrain on vast, uncurated datasets minimizes reliance on annotated data while improving generalization and adaptability for a wide range of downstream tasks. In the medical domain, foundation models address critical challenges by combining the information from various medical imaging modalities with textual data from radiology reports and clinical notes. This integration has enabled the development of tools that streamline diagnostic workflows, enhance accuracy (ACC), and enable robust decision-making. This review provides a systematic examination of the recent advancements in medical VLMs from 2022 to 2024, focusing on modality-specific approaches and tailored applications in medical imaging. The key contributions include the creation of a structured taxonomy to categorize existing models, an in-depth analysis of datasets essential for training and evaluation, and a review of practical applications. This review also addresses ongoing challenges and proposes future directions for enhancing the accessibility and impact of foundation models in healthcare.
The online version contains supplementary material available at 10.1007/s13534-025-00484-6.
基础模型,包括大语言模型和视觉语言模型(VLM),通过在各种应用中实现高效、可扩展的多模态学习,彻底改变了人工智能。通过利用自监督和半监督学习的进展,这些模型整合了计算机视觉和自然语言处理,以解决复杂任务,如疾病分类、分割、跨模态检索和自动报告生成。它们在大量未整理的数据集上进行预训练的能力,在提高对广泛下游任务的泛化和适应性的同时,最大限度地减少了对标注数据的依赖。在医学领域,基础模型通过将来自各种医学成像模态的信息与来自放射学报告和临床笔记的文本数据相结合,应对关键挑战。这种整合推动了简化诊断工作流程、提高准确性(ACC)并实现稳健决策的工具的开发。本综述系统地考察了2022年至2024年医学VLM的最新进展,重点关注医学成像中特定模态的方法和定制应用。主要贡献包括创建一个结构化分类法来对现有模型进行分类,对训练和评估所需数据集的深入分析,以及对实际应用的综述。本综述还讨论了当前面临的挑战,并提出了未来的方向,以提高基础模型在医疗保健中的可及性和影响力。
在线版本包含可在10.1007/s13534-025-00484-6获取的补充材料。