Insitute for Computing and Information Sciences, Radboud University, Nijmegen, The Netherlands.
Department of Enterprise Engineering, University of Rome Tor Vergata, Rome, Italy.
Artif Intell Med. 2021 Jun;116:102075. doi: 10.1016/j.artmed.2021.102075. Epub 2021 Apr 15.
Radiology reports are of core importance for the communication between the radiologist and clinician. A computer-aided radiology report system can assist radiologists in this task and reduce variation between reports thus facilitating communication with the medical doctor or clinician. Producing a well structured, clear, and clinically well-focused radiology report is essential for high-quality patient diagnosis and care. Despite recent advances in deep learning for image caption generation, this task remains highly challenging in a medical setting. Research has mainly focused on the design of tailored machine learning methods for this task, while little attention has been devoted to the development of evaluation metrics to assess the quality of AI-generated documents. Conventional quality metrics for natural language processing methods like the popular BLEU score, provide little information about the quality of the diagnostic content of AI-generated radiology reports. In particular, because radiology reports often use standardized sentences, BLEU scores of generated reports can be high while they lack diagnostically important information. We investigate this problem and propose a new measure that quantifies the diagnostic content of AI-generated radiology reports. In addition, we exploit the standardization of reports by generating a sequence of sentences. That is, instead of using a dictionary of words, as current image captioning methods do, we use a dictionary of sentences. The assumption underlying this choice is that radiologists use a well-focused vocabulary of 'standard' sentences, which should suffice for composing most reports. As a by-product, a significant training speed-up is achieved compared to models trained on a dictionary of words. Overall, results of our investigation indicate that standard validation metrics for AI-generated documents are weakly correlated with the diagnostic content of the reports. Therefore, these measures should be not used as only validation metrics, and measures evaluating diagnostic content should be preferred in such a medical context.
放射学报告对于放射科医生和临床医生之间的沟通至关重要。计算机辅助放射学报告系统可以帮助放射科医生完成这项任务,并减少报告之间的差异,从而促进与医生或临床医生的沟通。生成结构良好、清晰且临床重点突出的放射学报告对于高质量的患者诊断和护理至关重要。尽管深度学习在图像标题生成方面取得了最新进展,但在医学环境中,这项任务仍然极具挑战性。研究主要集中在为此任务设计定制的机器学习方法上,而很少关注开发评估指标来评估 AI 生成文档的质量。传统的自然语言处理方法的质量指标,如流行的 BLEU 分数,提供的关于 AI 生成的放射学报告的诊断内容的质量的信息很少。特别是,由于放射学报告通常使用标准化的句子,生成报告的 BLEU 分数可能很高,而它们缺乏诊断上重要的信息。我们研究了这个问题,并提出了一种新的度量标准,用于量化 AI 生成的放射学报告的诊断内容。此外,我们通过生成句子序列来利用报告的标准化。也就是说,与当前的图像标题生成方法使用单词字典不同,我们使用句子字典。这种选择的基础假设是放射科医生使用聚焦良好的“标准”句子词汇表,这应该足以组成大多数报告。作为副产品,与在单词字典上训练的模型相比,我们的模型实现了显著的训练加速。总体而言,我们的研究结果表明,AI 生成文档的标准验证指标与报告的诊断内容相关性较弱。因此,在这种医学背景下,这些措施不应该仅作为验证指标使用,而应该优先使用评估诊断内容的措施。