IEEE J Biomed Health Inform. 2024 Sep;28(9):5600-5612. doi: 10.1109/JBHI.2024.3414413. Epub 2024 Sep 5.
Medical report generation, as a cross-modal automatic text generation task, can be highly significant both in research and clinical fields. The core is to generate diagnosis reports in clinical language from medical images. However, several limitations persist, including a lack of global information, inadequate cross-modal fusion capabilities, and high computational demands. To address these issues, we propose cross-modal global feature fusion Transformer (CGFTrans) to extract global information meanwhile reduce computational strain. Firstly, we introduce mesh recurrent network to capture inter-layer information at different levels to address the absence of global features. Then, we design feature fusion decoder and define 'mid-fusion' strategy to separately fuse visual and global features with medical report embeddings, which enhances the ability of the cross-modal joint learning. Finally, we integrate shifted window attention into Transformer encoder to alleviate computational pressure and capture pathological information at multiple scales. Extensive experiments conducted on three datasets demonstrate that the proposed method achieves average increments of 2.9%, 1.5%, and 0.7% in terms of the BLEU-1, METEOR and ROUGE-L metrics, respectively. Besides, it achieves average increments -22.4% and 17.3% training time and images throughput, respectively.
医疗报告生成作为一种跨模态自动文本生成任务,无论是在研究还是临床领域都具有非常重要的意义。其核心是从医学图像中以临床语言生成诊断报告。然而,该任务仍存在一些局限性,包括缺乏全局信息、模态间融合能力不足以及计算需求较高等。为了解决这些问题,我们提出了跨模态全局特征融合 Transformer(CGFTrans),以提取全局信息,同时降低计算负担。首先,我们引入了网格递归网络来捕捉不同层次之间的层间信息,以解决全局特征缺失的问题。然后,我们设计了特征融合解码器,并定义了“中融合”策略,分别将视觉特征和全局特征与医疗报告嵌入融合,从而增强了跨模态联合学习的能力。最后,我们将移位窗口注意力集成到 Transformer 编码器中,以减轻计算压力并在多个尺度上捕获病理信息。在三个数据集上进行的广泛实验表明,与基线模型相比,我们提出的方法在 BLEU-1、METEOR 和 ROUGE-L 指标上分别平均提高了 2.9%、1.5%和 0.7%。此外,它的训练时间和图像吞吐量分别平均减少了 22.4%和 17.3%。