Wang Rui, Meng Jiana, Yu Yuhai, Han Siwei, Li Xinghao
Computer Science and Engineering College, Dalian Minzu University, Dalian, Liaoning 116650, P. R. China.
Sheng Wu Yi Xue Gong Cheng Xue Za Zhi. 2025 Jun 25;42(3):560-566. doi: 10.7507/1001-5515.202412040.
Medical visual question answering (MVQA) plays a crucial role in the fields of computer-aided diagnosis and telemedicine. Due to the limited size and uneven annotation quality of the MVQA datasets, most existing methods rely on additional datasets for pre-training and use discriminant formulas to predict answers from a predefined set of labels. This approach makes the model prone to overfitting in low resource domains. To cope with the above problems, we propose an image-aware generative MVQA method based on image caption prompts. Firstly, we combine a dual visual feature extractor with a progressive bilinear attention interaction module to extract multi-level image features. Secondly, we propose an image caption prompt method to guide the model to better understand the image information. Finally, the image-aware generative model is used to generate answers. Experimental results show that our proposed method outperforms existing models on the MVQA task, realizing efficient visual feature extraction, as well as flexible and accurate answer outputs with small computational costs in low-resource domains. It is of great significance for achieving personalized precision medicine, reducing medical burden, and improving medical diagnosis efficiency.
医学视觉问答(MVQA)在计算机辅助诊断和远程医疗领域发挥着至关重要的作用。由于MVQA数据集规模有限且标注质量参差不齐,大多数现有方法依赖额外的数据集进行预训练,并使用判别公式从预定义的标签集中预测答案。这种方法使得模型在低资源领域容易出现过拟合。为了解决上述问题,我们提出了一种基于图像字幕提示的图像感知生成式MVQA方法。首先,我们将双视觉特征提取器与渐进双线性注意力交互模块相结合,以提取多级图像特征。其次,我们提出了一种图像字幕提示方法,以引导模型更好地理解图像信息。最后,使用图像感知生成模型来生成答案。实验结果表明,我们提出的方法在MVQA任务上优于现有模型,实现了高效的视觉特征提取,以及在低资源领域以较小的计算成本灵活准确地输出答案。这对于实现个性化精准医疗、减轻医疗负担和提高医疗诊断效率具有重要意义。