Croxford Emma, Gao Yanjun, First Elliot, Pellegrino Nicholas, Schnier Miranda, Caskey John, Oguss Madeline, Wills Graham, Chen Guanhua, Dligach Dmitriy, Churpek Matthew M, Mayampurath Anoop, Liao Frank, Goswami Cherodeep, Wong Karen K, Patterson Brian W, Afshar Majid
Department of Biostatistics and Medical Informatics, University of Wisconsin, Madison, USA.
Department of Biomedical Informatics, University of Colorado - Anschutz Medical, Aurora, USA.
medRxiv. 2025 May 6:2025.04.22.25326219. doi: 10.1101/2025.04.22.25326219.
Electronic Health Records (EHRs) store vast amounts of clinical information that are difficult for healthcare providers to summarize and synthesize relevant details to their practice. To reduce cognitive load on providers, generative AI with Large Language Models have emerged to automatically summarize patient records into clear, actionable insights and offload the cognitive burden for providers. However, LLM summaries need to be precise and free from errors, making evaluations on the quality of the summaries necessary. While human experts are the gold standard for evaluations, their involvement is time-consuming and costly. Therefore, we introduce and validate an automated method for evaluating real-world EHR multi-document summaries using an LLM as the evaluator, referred to as LLM-as-a-Judge. Benchmarking against the validated Provider Documentation Summarization Quality Instrument (PDSQI)-9 for human evaluation, our LLM-as-a-Judge framework demonstrated strong inter-rater reliability with human evaluators. GPT-o3-mini achieved the highest intraclass correlation coefficient of 0.818 (95% CI 0.772, 0.854), with a median score difference of 0 from human evaluators, and completes evaluations in just 22 seconds. Overall, the reasoning models excelled in inter-rater reliability, particularly in evaluations that require advanced reasoning and domain expertise, outperforming non-reasoning models, those trained on the task, and multi-agent workflows. Cross-task validation on the Problem Summarization task similarly confirmed high reliability. By automating high-quality evaluations, medical LLM-as-a-Judge offers a scalable, efficient solution to rapidly identify accurate and safe AI-generated summaries in healthcare settings.
电子健康记录(EHRs)存储了大量临床信息,医疗服务提供者很难对这些信息进行总结并综合提炼出与其实践相关的细节。为了减轻医疗服务提供者的认知负担,带有大语言模型的生成式人工智能应运而生,它可以自动将患者记录总结为清晰、可操作的见解,从而减轻医疗服务提供者的认知负担。然而,大语言模型的总结需要精确无误,因此有必要对总结的质量进行评估。虽然人类专家是评估的金标准,但他们的参与既耗时又昂贵。因此,我们引入并验证了一种使用大语言模型作为评估器来评估真实世界EHR多文档总结的自动化方法,即大语言模型作为评判器(LLM-as-a-Judge)。与经过验证的用于人工评估的提供者文档总结质量工具(PDSQI)-9进行基准测试,我们的大语言模型作为评判器框架显示出与人类评估者之间很强的评分者间信度。GPT-o3-mini实现了最高的组内相关系数0.818(95%置信区间0.772, 0.854),与人类评估者的中位数分数差异为0,并且只需22秒就能完成评估。总体而言,推理模型在评分者间信度方面表现出色,特别是在需要高级推理和领域专业知识的评估中,优于非推理模型、在该任务上训练的模型以及多智能体工作流程。在问题总结任务上的跨任务验证同样证实了高信度。通过自动化高质量评估,医疗大语言模型作为评判器提供了一种可扩展、高效的解决方案,能够在医疗环境中快速识别准确且安全的人工智能生成的总结。