Vendrow Edward, Schonfeld Ethan
Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, 50 Vassar St, Cambridge, MA, United States of America.
School of Medicine, Stanford University, 291 Campus Drive, Stanford, CA, United States of America.
Heliyon. 2023 Jul 10;9(7):e17968. doi: 10.1016/j.heliyon.2023.e17968. eCollection 2023 Jul.
The image captioning task is increasingly prevalent in artificial intelligence applications for medicine. One important application is clinical report generation from chest radiographs. The clinical writing of unstructured reports is time consuming and error-prone. An automated system would improve standardization, error reduction, time consumption, and medical accessibility. In this paper we demonstrate the importance of domain specific pre-training and propose a modified transformer architecture for the medical image captioning task. To accomplish this, we train a series of modified transformers to generate clinical reports from chest radiograph image input. These modified transformers include: a meshed-memory augmented transformer architecture with visual extractor using ImageNet pre-trained weights, a meshed-memory augmented transformer architecture with visual extractor using CheXpert pre-trained weights, and a meshed-memory augmented transformer whose encoder is passed the concatenated embeddings using both ImageNet pre-trained weights and CheXpert pre-trained weights. We use BLEU(1-4), ROUGE-L, CIDEr, and the clinical CheXbert F1 scores to validate our models and demonstrate competitive scores with state of the art models. We provide evidence that ImageNet pre-training is ill-suited for the medical image captioning task, especially for less frequent conditions (e.g.: enlarged cardiomediastinum, lung lesion, pneumothorax). Furthermore, we demonstrate that the double feature model improves performance for specific medical conditions (edema, consolidation, pneumothorax, support devices) and overall CheXbert F1 score, and should be further developed in future work. Such a double feature model, including both ImageNet pre-training as well as domain specific pre-training, could be used in a wide range of image captioning models in medicine.
图像字幕任务在医学人工智能应用中越来越普遍。一个重要的应用是根据胸部X光片生成临床报告。非结构化报告的临床撰写既耗时又容易出错。一个自动化系统将提高标准化程度、减少错误、节省时间并改善医疗可及性。在本文中,我们展示了特定领域预训练的重要性,并提出了一种用于医学图像字幕任务的改进型Transformer架构。为了实现这一目标,我们训练了一系列改进型Transformer,以便根据胸部X光片图像输入生成临床报告。这些改进型Transformer包括:一种带有视觉提取器的网格记忆增强Transformer架构,该视觉提取器使用ImageNet预训练权重;一种带有视觉提取器的网格记忆增强Transformer架构,该视觉提取器使用CheXpert预训练权重;以及一种网格记忆增强Transformer,其编码器被输入使用ImageNet预训练权重和CheXpert预训练权重的拼接嵌入。我们使用BLEU(1 - 4)、ROUGE - L、CIDEr和临床CheXbert F1分数来验证我们的模型,并展示与现有最先进模型相比具有竞争力的分数。我们提供证据表明,ImageNet预训练不适用于医学图像字幕任务,特别是对于不太常见的病症(例如:心影增大、肺部病变、气胸)。此外,我们证明了双特征模型在特定医学病症(水肿、实变、气胸、支撑装置)和整体CheXbert F1分数方面提高了性能,并且在未来的工作中应进一步开发。这样一种包括ImageNet预训练以及特定领域预训练的双特征模型,可以用于医学领域的广泛图像字幕模型中。