Banerjee Oishi, Saenz Agustina, Wu Kay, Clements Warren, Zia Adil, Buensalido Dominic, Kavnoudias Helen, Abi-Ghanem Alain S, Ghawi Nour El, Luna Cibele, Castillo Patricia, Al-Surimi Khaled, Daghistani Rayyan A, Chen Yuh-Min, Chao Heng-Sheng, Heiliger Lars, Kim Moon, Haubold Johannes, Jonske Frederic, Rajpurkar Pranav
Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA,
Department of Radiology, Alfred Health, Melbourne, Victoria, Australia.
Pac Symp Biocomput. 2025;30:185-198. doi: 10.1142/9789819807024_0014.
Given the rapidly expanding capabilities of generative AI models for radiology, there is a need for robust metrics that can accurately measure the quality of AI-generated radiology reports across diverse hospitals. We develop ReXamine-Global, a LLM-powered, multi-site framework that tests metrics across different writing styles and patient populations, exposing gaps in their generalization. First, our method tests whether a metric is undesirably sensitive to reporting style, providing different scores depending on whether AI-generated reports are stylistically similar to ground-truth reports or not. Second, our method measures whether a metric reliably agrees with experts, or whether metric and expert scores of AI-generated report quality diverge for some sites. Using 240 reports from 6 hospitals around the world, we apply ReXamine-Global to 7 established report evaluation metrics and uncover serious gaps in their generalizability. Developers can apply ReXamine-Global when designing new report evaluation metrics, ensuring their robustness across sites. Additionally, our analysis of existing metrics can guide users of those metrics towards evaluation procedures that work reliably at their sites of interest.
鉴于生成式人工智能模型在放射学领域的能力迅速扩展,需要有强大的指标来准确衡量不同医院中人工智能生成的放射学报告的质量。我们开发了ReXamine-Global,这是一个由大型语言模型驱动的多站点框架,可在不同的写作风格和患者群体中测试指标,揭示其泛化能力的差距。首先,我们的方法测试一个指标是否对报告风格有不期望的敏感性,根据人工智能生成的报告在风格上是否与真实报告相似而给出不同的分数。其次,我们的方法衡量一个指标是否与专家可靠地一致,或者人工智能生成报告质量的指标分数和专家分数在某些站点是否存在差异。我们使用来自全球6家医院的240份报告,将ReXamine-Global应用于7个既定的报告评估指标,并发现它们在泛化能力方面存在严重差距。开发人员在设计新的报告评估指标时可以应用ReXamine-Global,确保其在各站点的稳健性。此外,我们对现有指标的分析可以指导这些指标的用户采用在其感兴趣的站点可靠运行的评估程序。