College of Electronics and Information Engineering, Sichuan University, Chengdu 610064, China.
Department of Computer Science, University of Maryland, College Park, MD 20742, USA.
Math Biosci Eng. 2022 Jul 20;19(10):10192-10212. doi: 10.3934/mbe.2022478.
Medical visual question answering (Med-VQA) aims to leverage a pre-trained artificial intelligence model to answer clinical questions raised by doctors or patients regarding radiology images. However, owing to the high professional requirements in the medical field and the difficulty of annotating medical data, Med-VQA lacks sufficient large-scale, well-annotated radiology images for training. Researchers have mainly focused on improving the ability of the model's visual feature extractor to address this problem. However, there are few researches focused on the textual feature extraction, and most of them underestimated the interactions between corresponding visual and textual features. In this study, we propose a corresponding feature fusion (CFF) method to strengthen the interactions of specific features from corresponding radiology images and questions. In addition, we designed a semantic attention (SA) module for textual feature extraction. This helps the model consciously focus on the meaningful words in various questions while reducing the attention spent on insignificant information. Extensive experiments demonstrate that the proposed method can achieve competitive results in two benchmark datasets and outperform existing state-of-the-art methods on answer prediction accuracy. Experimental results also prove that our model is capable of semantic understanding during answer prediction, which has certain advantages in Med-VQA.
医学视觉问答 (Med-VQA) 的目的是利用预先训练好的人工智能模型来回答医生或患者针对放射影像提出的临床问题。然而,由于医学领域的专业性要求较高,且医学数据的标注难度较大,Med-VQA 缺乏足够大规模、标注良好的放射影像数据进行训练。研究人员主要致力于提高模型视觉特征提取器的能力,以解决这个问题。然而,很少有研究关注文本特征提取,而且大多数研究都低估了相应的视觉和文本特征之间的交互作用。在这项研究中,我们提出了一种相应的特征融合 (CFF) 方法,以增强来自相应放射影像和问题的特定特征之间的交互作用。此外,我们还设计了一个文本特征提取的语义注意力 (SA) 模块。这有助于模型在关注各种问题中具有意义的单词的同时,减少对不重要信息的关注。大量实验证明,所提出的方法在两个基准数据集上可以取得有竞争力的结果,在答案预测准确性方面优于现有的最先进方法。实验结果还证明,我们的模型在进行答案预测时能够进行语义理解,在 Med-VQA 中具有一定的优势。