Park Sangjoon, Lee Eun Sun, Shin Kyung Sook, Lee Jeong Eun, Ye Jong Chul
Department of Radiation Oncology, Yonsei College of Medicine, Seoul, Republic of Korea.
Chung-Ang University Hospital, Seoul, Republic of Korea.
Med Image Anal. 2024 Jan;91:103021. doi: 10.1016/j.media.2023.103021. Epub 2023 Nov 7.
The escalating demand for artificial intelligence (AI) systems that can monitor and supervise human errors and abnormalities in healthcare presents unique challenges. Recent advances in vision-language models reveal the challenges of monitoring AI by understanding both visual and textual concepts and their semantic correspondences. However, there has been limited success in the application of vision-language models in the medical domain. Current vision-language models and learning strategies for photographic images and captions call for a web-scale data corpus of image and text pairs which is not often feasible in the medical domain. To address this, we present a model named medical cross-attention vision-language model (Medical X-VL), which leverages key components to be tailored for the medical domain. The model is based on the following components: self-supervised unimodal models in medical domain and a fusion encoder to bridge them, momentum distillation, sentencewise contrastive learning for medical reports, and sentence similarity-adjusted hard negative mining. We experimentally demonstrated that our model enables various zero-shot tasks for monitoring AI, ranging from the zero-shot classification to zero-shot error correction. Our model outperformed current state-of-the-art models in two medical image datasets, suggesting a novel clinical application of our monitoring AI model to alleviate human errors. Our method demonstrates a more specialized capacity for fine-grained understanding, which presents a distinct advantage particularly applicable to the medical domain.
对能够监测和监督医疗保健中人为错误及异常情况的人工智能(AI)系统的需求不断升级,带来了独特的挑战。视觉语言模型的最新进展揭示了通过理解视觉和文本概念及其语义对应关系来监测人工智能的挑战。然而,视觉语言模型在医学领域的应用取得的成功有限。当前针对摄影图像和字幕的视觉语言模型及学习策略需要一个网络规模的图像与文本对数据语料库,这在医学领域通常不可行。为解决这一问题,我们提出了一种名为医学交叉注意力视觉语言模型(Medical X-VL)的模型,该模型利用了为医学领域量身定制的关键组件。该模型基于以下组件:医学领域的自监督单模态模型以及用于连接它们的融合编码器、动量蒸馏、针对医学报告的逐句对比学习,以及句子相似度调整的硬负样本挖掘。我们通过实验证明,我们的模型能够实现各种用于监测人工智能的零样本任务,从零样本分类到零样本纠错。在两个医学图像数据集中,我们的模型优于当前的最先进模型,这表明我们的监测人工智能模型在减轻人为错误方面具有新的临床应用。我们的方法展示了更强的细粒度理解能力,这在医学领域具有特别明显的优势。