Suppr超能文献

用于医学图像视觉问答的视觉语言模型。

Vision-Language Model for Visual Question Answering in Medical Imagery.

作者信息

Bazi Yakoub, Rahhal Mohamad Mahmoud Al, Bashmal Laila, Zuair Mansour

机构信息

Computer Engineering Department, College of Computer and Information Sciences, King Saud University, Riyadh 11543, Saudi Arabia.

Applied Computer Science Department, College of Applied Computer Science, King Saud University, Riyadh 11543, Saudi Arabia.

出版信息

Bioengineering (Basel). 2023 Mar 20;10(3):380. doi: 10.3390/bioengineering10030380.

Abstract

In the clinical and healthcare domains, medical images play a critical role. A mature medical visual question answering system (VQA) can improve diagnosis by answering clinical questions presented with a medical image. Despite its enormous potential in the healthcare industry and services, this technology is still in its infancy and is far from practical use. This paper introduces an approach based on a transformer encoder-decoder architecture. Specifically, we extract image features using the vision transformer (ViT) model, and we embed the question using a textual encoder transformer. Then, we concatenate the resulting visual and textual representations and feed them into a multi-modal decoder for generating the answer in an autoregressive way. In the experiments, we validate the proposed model on two VQA datasets for radiology images termed VQA-RAD and PathVQA. The model shows promising results compared to existing solutions. It yields closed and open accuracies of 84.99% and 72.97%, respectively, for VQA-RAD, and 83.86% and 62.37%, respectively, for PathVQA. Other metrics such as the BLUE score showing the alignment between the predicted and true answer sentences are also reported.

摘要

在临床和医疗保健领域,医学图像发挥着关键作用。一个成熟的医学视觉问答系统(VQA)可以通过回答与医学图像相关的临床问题来改善诊断。尽管这项技术在医疗行业和服务中具有巨大潜力,但它仍处于起步阶段,离实际应用还很远。本文介绍了一种基于Transformer编码器-解码器架构的方法。具体来说,我们使用视觉Transformer(ViT)模型提取图像特征,并使用文本编码器Transformer对问题进行嵌入。然后,我们将得到的视觉和文本表示连接起来,并将它们输入到一个多模态解码器中,以自回归的方式生成答案。在实验中,我们在两个用于放射学图像的VQA数据集VQA-RAD和PathVQA上验证了所提出的模型。与现有解决方案相比,该模型显示出了有前景的结果。对于VQA-RAD,它的封闭准确率和开放准确率分别为84.99%和72.97%,对于PathVQA,分别为83.86%和62.37%。还报告了其他指标,如显示预测答案句子与真实答案句子之间一致性的BLUE分数。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0727/10045796/802acae18208/bioengineering-10-00380-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验