Wang Jun, Bhalerao Abhir, Yin Terry, See Simon, He Yulan
IEEE J Biomed Health Inform. 2024 Jan 16;PP. doi: 10.1109/JBHI.2024.3354712.
Radiology report generation (RRG) has gained increasing research attention because of its huge potential to mitigate medical resource shortages and aid the process of disease decision making by radiologists. Recent advancements in Radiology Report Generation (RRG) are largely driven by improving a model's capabilities in encoding single-modal feature representations, while few studies explicitly explore the cross-modal alignment between image regions and words. Radiologists typically focus first on abnormal image regions before composing the corresponding text descriptions, thus cross-modal alignment is of great importance to learn a RRG model which is aware of abnormalities in the image. Motivated by this, we propose a Class Activation Map guided Attention Network (CAMANet) which explicitly promotes cross-modal alignment by employing aggregated class activation maps to supervise cross-modal attention learning, and simultaneously enrich the discriminative information. Experimental results demonstrate that CAMANet outperforms previous SOTA methods on two commonly used RRG benchmarks.
放射学报告生成(RRG)因其在缓解医疗资源短缺以及辅助放射科医生进行疾病决策过程方面的巨大潜力而受到越来越多的研究关注。放射学报告生成(RRG)的最新进展很大程度上是由提高模型在编码单模态特征表示方面的能力所驱动的,而很少有研究明确探索图像区域和单词之间的跨模态对齐。放射科医生在撰写相应的文本描述之前通常首先关注异常图像区域,因此跨模态对齐对于学习一个能够识别图像中异常情况的RRG模型非常重要。受此启发,我们提出了一种类激活映射引导注意力网络(CAMANet),该网络通过使用聚合类激活映射来监督跨模态注意力学习,从而明确促进跨模态对齐,并同时丰富判别信息。实验结果表明,CAMANet在两个常用的RRG基准测试中优于先前的最优方法。