IEEE Trans Med Imaging. 2022 Oct;41(10):2803-2813. doi: 10.1109/TMI.2022.3171661. Epub 2022 Sep 30.
Automated radiographic report generation is challenging in at least two aspects. First, medical images are very similar to each other and the visual differences of clinic importance are often fine-grained. Second, the disease-related words may be submerged by many similar sentences describing the common content of the images, causing the abnormal to be misinterpreted as the normal in the worst case. To tackle these challenges, this paper proposes a pure transformer-based framework to jointly enforce better visual-textual alignment, multi-label diagnostic classification, and word importance weighting, to facilitate report generation. To the best of our knowledge, this is the first pure transformer-based framework for medical report generation, which enjoys the capacity of transformer in learning long range dependencies for both image regions and sentence words. Specifically, for the first challenge, we design a novel mechanism to embed an auxiliary image-text matching objective into the transformer's encoder-decoder structure, so that better correlated image and text features could be learned to help a report to discriminate similar images. For the second challenge, we integrate an additional multi-label classification task into our framework to guide the model in making correct diagnostic predictions. Also, a term-weighting scheme is proposed to reflect the importance of words for training so that our model would not miss key discriminative information. Our work achieves promising performance over the state-of-the-arts on two benchmark datasets, including the largest dataset MIMIC-CXR.
自动放射报告生成至少在两个方面具有挑战性。首先,医学图像彼此非常相似,临床重要性的视觉差异通常很细微。其次,与疾病相关的词汇可能会被许多描述图像常见内容的相似句子所淹没,在最坏的情况下,异常可能会被误解为正常。为了解决这些挑战,本文提出了一个纯基于转换器的框架,共同强制更好的视觉-文本对齐、多标签诊断分类和单词重要性加权,以促进报告生成。据我们所知,这是第一个用于医疗报告生成的纯基于转换器的框架,它具有转换器在学习图像区域和句子单词的长距离依赖关系方面的能力。具体来说,对于第一个挑战,我们设计了一种新颖的机制,将辅助的图像-文本匹配目标嵌入到转换器的编码器-解码器结构中,以便更好地学习相关的图像和文本特征,帮助报告区分相似的图像。对于第二个挑战,我们将一个额外的多标签分类任务集成到我们的框架中,以指导模型做出正确的诊断预测。此外,还提出了一种术语加权方案来反映训练中单词的重要性,以使我们的模型不会错过关键的鉴别信息。我们的工作在两个基准数据集上取得了比现有技术更有希望的性能,包括最大的数据集 MIMIC-CXR。