快速对用于诊断合并症患者的大语言模型进行基准测试：利用“大语言模型即评判者”方法的比较研究

Rapidly Benchmarking Large Language Models for Diagnosing Comorbid Patients: Comparative Study Leveraging the LLM-as-a-Judge Method.

作者信息

Sarvari Peter, Al-Fagih Zaid

机构信息

Rhazes AI, First Floor, 85 Great Portland Street, London, W1W 7LT, United Kingdom.

出版信息

JMIRx Med. 2025 Aug 29;6:e67661. doi: 10.2196/67661.

DOI:10.2196/67661

PMID:40880236

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12396308/

Abstract

BACKGROUND

On average, 1 in 10 patients die because of a diagnostic error, and medical errors represent the third largest cause of death in the United States. While large language models (LLMs) have been proposed to aid doctors in diagnoses, no research results have been published comparing the diagnostic abilities of many popular LLMs on a large, openly accessible real-patient cohort.

OBJECTIVE

In this study, we set out to compare the diagnostic ability of 18 LLMs from Google, OpenAI, Meta, Mistral, Cohere, and Anthropic, using 3 prompts, 2 temperature settings, and 1000 randomly selected Medical Information Mart for Intensive Care-IV (MIMIC-IV) hospital admissions. We also explore improving the diagnostic hit rate of GPT-4o 05-13 with retrieval-augmented generation (RAG) by utilizing reference ranges provided by the American Board of Internal Medicine.

METHODS

We evaluated the diagnostic ability of 21 LLMs, using an LLM-as-a-judge approach (an automated, LLM-based evaluation) on MIMIC-IV patient records, which contain final diagnostic codes. For each case, a separate assessor LLM ("judge") compared the predictor LLM's diagnostic output to the true diagnoses from the patient record. The assessor determined whether each true diagnosis was inferable from the available data and, if so, whether it was correctly predicted ("hit") or not ("miss"). Diagnoses not inferable from the patient record were excluded from the hit rate analysis. The reported hit rate was defined as the number of hits divided by the total number of hits and misses. The statistical significance of the differences in model performance was assessed using a pooled z-test for proportions.

RESULTS

Gemini 2.5 was the top performer with a hit rate of 97.4% (95% CI 97.0%-97.8%) as assessed by GPT-4.1, significantly outperforming GPT-4.1, Claude-4 Opus, and Claude Sonnet. However, GPT-4.1 ranked the highest in a separate set of experiments evaluated by GPT-4 Turbo, which tended to be less conservative than GPT-4.1 in its assessments. Significant variation in diagnostic hit rates was observed across different prompts, while changes in temperature generally had little effect. Finally, RAG significantly improved the hit rate of GPT-4o 05-13 by an average of 0.8% (P<.006).

CONCLUSIONS

While the results are promising, more diverse datasets and hospital pilots, as well as close collaborations with physicians, are needed to obtain a better understanding of the diagnostic abilities of these models.

摘要

背景

平均而言，每10名患者中就有1人因诊断错误而死亡，医疗错误是美国第三大死因。虽然有人提出使用大语言模型（LLMs）来协助医生进行诊断，但尚未有研究结果发表，比较众多流行的大语言模型在一个大规模、可公开获取的真实患者队列中的诊断能力。

目的

在本研究中，我们使用3种提示、2种温度设置以及1000例随机选择的重症监护医学信息集市 - 第四版（MIMIC-IV）医院入院病例，来比较谷歌、OpenAI、Meta、米斯特拉尔、Cohere和Anthropic的18个大语言模型的诊断能力。我们还通过利用美国内科医学委员会提供的参考范围，探索使用检索增强生成（RAG）来提高GPT-4o 05-13的诊断命中率。

方法

我们使用基于大语言模型的评判方法（一种基于大语言模型的自动化评估），在包含最终诊断代码的MIMIC-IV患者记录上评估21个大语言模型的诊断能力。对于每个病例，一个单独的评估大语言模型（“评判者”）将预测大语言模型的诊断输出与患者记录中的真实诊断进行比较。评估者确定每个真实诊断是否可从可用数据中推断出来，如果可以，则确定它是否被正确预测（“命中”）或未被正确预测（“未命中”）。无法从患者记录中推断出的诊断被排除在命中率分析之外。报告的命中率定义为命中次数除以命中次数和未命中次数的总和。使用合并的z检验来评估模型性能差异的统计学显著性。

结果

据GPT-4.1评估，Gemini 2.5表现最佳，命中率为97.4%（95%置信区间97.0%-97.8%），显著优于GPT-4.1、Claude-4 Opus和Claude Sonnet。然而，在由GPT-4 Turbo评估的另一组实验中，GPT-4.1排名最高，GPT-4 Turbo在评估中往往比GPT-4.1更不保守。在不同提示下观察到诊断命中率存在显著差异，而温度变化通常影响不大。最后，RAG显著提高了GPT-4o 05-13的命中率，平均提高了0.8%（P<0.006）。