Zhang Xiaoman, Wu Chaoyi, Zhao Ziheng, Lin Weixiong, Zhang Ya, Wang Yanfeng, Xie Weidi
Shanghai Jiao Tong University, Shanghai, China.
Shanghai Artificial Intelligence Laboratory, Shanghai, China.
Commun Med (Lond). 2024 Dec 21;4(1):277. doi: 10.1038/s43856-024-00709-2.
Medical Visual Question Answering (MedVQA) enhances diagnostic accuracy and healthcare delivery by leveraging artificial intelligence to interpret medical images. This study aims to redefine MedVQA as a generation task that mirrors human-machine interaction and to develop a model capable of integrating complex visual and textual information.
We constructed a large-scale medical visual-question answering dataset, PMC-VQA, containing 227,000 VQA pairs across 149,000 images that span various modalities and diseases. We introduced a generative model that aligns visual information from a pre-trained vision encoder with a large language model. This model was initially trained on PMC-VQA and subsequently fine-tuned on multiple public benchmarks.
Here, we show that our model significantly outperforms existing MedVQA models in generating relevant, accurate free-form answers. We also propose a manually verified test set that presents a greater challenge and serves as a robust measure to monitor the advancement of generative MedVQA methods.
The PMC-VQA dataset proves to be an essential resource for the research community, and our model marks a significant breakthrough in MedVQA. We maintain a leaderboard to facilitate comprehensive evaluation and comparison, providing a centralized resource for benchmarking state-of-the-art approaches.
医学视觉问答(MedVQA)通过利用人工智能解读医学图像来提高诊断准确性和医疗服务水平。本研究旨在将MedVQA重新定义为一种模拟人机交互的生成任务,并开发一种能够整合复杂视觉和文本信息的模型。
我们构建了一个大规模的医学视觉问答数据集PMC-VQA,其中包含跨越149,000张图像的227,000个问答对,这些图像涵盖了各种模态和疾病。我们引入了一种生成模型,该模型将预训练视觉编码器中的视觉信息与一个大语言模型对齐。此模型最初在PMC-VQA上进行训练,随后在多个公共基准上进行微调。
在此,我们表明我们的模型在生成相关、准确的自由形式答案方面显著优于现有的MedVQA模型。我们还提出了一个经过人工验证的测试集,该测试集带来了更大的挑战,并作为监测生成式MedVQA方法进展的有力指标。
PMC-VQA数据集被证明是研究社区的重要资源,我们的模型在MedVQA方面取得了重大突破。我们维护了一个排行榜以促进全面评估和比较,为基准测试最先进的方法提供了一个集中资源。