基于图像字幕提示的图像感知生成式医学视觉问答

[Image-aware generative medical visual question answering based on image caption prompts].

作者信息

Wang Rui, Meng Jiana, Yu Yuhai, Han Siwei, Li Xinghao

机构信息

Computer Science and Engineering College, Dalian Minzu University, Dalian, Liaoning 116650, P. R. China.

出版信息

Sheng Wu Yi Xue Gong Cheng Xue Za Zhi. 2025 Jun 25;42(3):560-566. doi: 10.7507/1001-5515.202412040.

DOI:10.7507/1001-5515.202412040

PMID:40566779

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12236207/

Abstract

Medical visual question answering (MVQA) plays a crucial role in the fields of computer-aided diagnosis and telemedicine. Due to the limited size and uneven annotation quality of the MVQA datasets, most existing methods rely on additional datasets for pre-training and use discriminant formulas to predict answers from a predefined set of labels. This approach makes the model prone to overfitting in low resource domains. To cope with the above problems, we propose an image-aware generative MVQA method based on image caption prompts. Firstly, we combine a dual visual feature extractor with a progressive bilinear attention interaction module to extract multi-level image features. Secondly, we propose an image caption prompt method to guide the model to better understand the image information. Finally, the image-aware generative model is used to generate answers. Experimental results show that our proposed method outperforms existing models on the MVQA task, realizing efficient visual feature extraction, as well as flexible and accurate answer outputs with small computational costs in low-resource domains. It is of great significance for achieving personalized precision medicine, reducing medical burden, and improving medical diagnosis efficiency.

摘要

医学视觉问答（MVQA）在计算机辅助诊断和远程医疗领域发挥着至关重要的作用。由于MVQA数据集规模有限且标注质量参差不齐，大多数现有方法依赖额外的数据集进行预训练，并使用判别公式从预定义的标签集中预测答案。这种方法使得模型在低资源领域容易出现过拟合。为了解决上述问题，我们提出了一种基于图像字幕提示的图像感知生成式MVQA方法。首先，我们将双视觉特征提取器与渐进双线性注意力交互模块相结合，以提取多级图像特征。其次，我们提出了一种图像字幕提示方法，以引导模型更好地理解图像信息。最后，使用图像感知生成模型来生成答案。实验结果表明，我们提出的方法在MVQA任务上优于现有模型，实现了高效的视觉特征提取，以及在低资源领域以较小的计算成本灵活准确地输出答案。这对于实现个性化精准医疗、减轻医疗负担和提高医疗诊断效率具有重要意义。

相似文献

[Image-aware generative medical visual question answering based on image caption prompts].基于图像字幕提示的图像感知生成式医学视觉问答

Sheng Wu Yi Xue Gong Cheng Xue Za Zhi. 2025 Jun 25;42(3):560-566. doi: 10.7507/1001-5515.202412040.

Improved IEC performance via emotional stimuli-aware captioning.通过情感刺激感知字幕改善图像情感分类性能。

Sci Rep. 2025 Jul 1;15(1):22173. doi: 10.1038/s41598-025-06094-7.

VIIDA and InViDe: computational approaches for generating and evaluating inclusive image paragraphs for the visually impaired.VIIDA和InViDe：为视障人士生成和评估包容性图像段落的计算方法。

Disabil Rehabil Assist Technol. 2025 Jul;20(5):1470-1495. doi: 10.1080/17483107.2024.2437567. Epub 2024 Dec 11.

Exploring the Potential of Electroencephalography Signal-Based Image Generation Using Diffusion Models: Integrative Framework Combining Mixed Methods and Multimodal Analysis.利用扩散模型探索基于脑电图信号的图像生成潜力：结合混合方法和多模态分析的综合框架

JMIR Med Inform. 2025 Jun 25;13:e72027. doi: 10.2196/72027.

Artificial intelligence for diagnosing exudative age-related macular degeneration.人工智能在渗出性年龄相关性黄斑变性诊断中的应用。

Cochrane Database Syst Rev. 2024 Oct 17;10(10):CD015522. doi: 10.1002/14651858.CD015522.pub2.

Social Reasoning-Aware Trajectory Prediction via Multimodal Language Model.通过多模态语言模型实现的社会推理感知轨迹预测

IEEE Trans Pattern Anal Mach Intell. 2025 Jun 20;PP. doi: 10.1109/TPAMI.2025.3582000.

Stigma Management Strategies of Autistic Social Media Users.自闭症社交媒体用户的污名管理策略

Autism Adulthood. 2025 May 28;7(3):273-282. doi: 10.1089/aut.2023.0095. eCollection 2025 Jun.

Automated Image-Based Wound Area Assessment in Outpatient Clinics Using Computer-Aided Methods: A Development and Validation Study.使用计算机辅助方法在门诊诊所进行基于图像的伤口面积自动评估：一项开发与验证研究。

Medicina (Kaunas). 2025 Jun 17;61(6):1099. doi: 10.3390/medicina61061099.

Prophylactic non-steroidal anti-inflammatory drugs for the prevention of macular oedema after cataract surgery.预防性使用非甾体抗炎药预防白内障手术后黄斑水肿。

Cochrane Database Syst Rev. 2016 Nov 1;11(11):CD006683. doi: 10.1002/14651858.CD006683.pub3.

Prophet: Prompting Large Language Models With Complementary Answer Heuristics for Knowledge-Based Visual Question Answering.Prophet：通过互补答案启发式方法提示大语言模型以进行基于知识的视觉问答

IEEE Trans Pattern Anal Mach Intell. 2025 Aug;47(8):6797-6808. doi: 10.1109/TPAMI.2025.3562422.

本文引用的文献

VQAMix: Conditional Triplet Mixup for Medical Visual Question Answering.VQAMix：用于医学视觉问答的条件三元组混合

IEEE Trans Med Imaging. 2022 Nov;41(11):3332-3343. doi: 10.1109/TMI.2022.3185008. Epub 2022 Oct 27.

A dataset of clinically generated visual questions and answers about radiology images.一个包含临床生成的关于放射影像的视觉问题和答案的数据集。

Sci Data. 2018 Nov 20;5:180251. doi: 10.1038/sdata.2018.251.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验