与医生相比，大语言模型（LLMs）在睡眠医学中的诊断性能。

Diagnostic performance of Large Language Models (LLMs) compared with physicians in sleep medicine.

作者信息

Patel Anshum, Ruoff Chad, Helgeson Scott A, Carvalho Diego Z, Castillo Pablo R, Cheung Joseph

机构信息

Division of Pulmonary, Allergy and Sleep Medicine, Mayo Clinic, Jacksonville, FL, USA.

Division of Pulmonary and Sleep Medicine, Mayo Clinic, Scottsdale, AZ, USA.

出版信息

Sleep Med. 2025 Oct;134:106677. doi: 10.1016/j.sleep.2025.106677. Epub 2025 Jul 19.

DOI:10.1016/j.sleep.2025.106677

PMID:40684750

Abstract

BACKGROUND

Artificial intelligence (AI), particularly large language models (LLMs) are increasingly being explored for diagnostic capabilities in medicine. Leveraging LLMs within clinical systems may augment clinicians' diagnostic reasoning. The diagnostic effectiveness of LLMs in sleep medicine remains unevaluated against expert performance in clinical case scenarios.

OBJECTIVE

To compare the diagnostic accuracy of three widely used LLMs and experienced sleep physicians on real-world clinical vignettes.

METHODS

Using sixteen diverse sleep disorder vignettes from the AASM Case Book (2019), each was independently presented to three LLMs (ChatGPT-4, Gemini 2.0, DeepSeek) and three board-certified sleep physicians. Differential diagnoses were compared to AASM reference lists (mean % matches), and final diagnoses were scored against the AASM final diagnosis (3-point Likert: 0 = No match, 1 = Partial match, 2 = Full match).

RESULTS

Analysis of differential diagnoses showed similar mean agreement percentages for ChatGPT-4 (76.7 %), Gemini 2.0 (77.7 %), DeepSeek (70.7 %), and physicians' average (72.9 %). Repeated measures ANOVA indicated no statistically significant difference in differential diagnostic accuracy between LLMs and physicians (p = 0.839). For final diagnoses, all three LLMs achieved an identical average concordance score (87.5 %), falling within the performance range of experienced physicians (81.3 %-96.9 %), indicating LLM diagnostic proficiency was comparable to experts on these case vignettes. Non-parametric Friedman testing showed no statistically significant difference among the individual entities (p = 0.602). Paired t-tests comparing average final diagnosis scores also showed no significant differences (p = 0.606).

CONCLUSIONS

LLMs showed diagnostic performance comparable to experienced sleep clinicians, suggesting their potential as supplementary tools. Future research should explore broader applications and integration.

摘要

背景

人工智能（AI），尤其是大语言模型（LLMs）在医学诊断能力方面正得到越来越多的探索。在临床系统中利用大语言模型可能会增强临床医生的诊断推理能力。在临床病例场景中，大语言模型在睡眠医学中的诊断有效性相对于专家表现仍未得到评估。

目的

比较三种广泛使用的大语言模型和经验丰富的睡眠医生在真实世界临床病例上的诊断准确性。

方法

使用来自美国睡眠医学会病例手册（2019年）的16个不同的睡眠障碍病例，每个病例分别独立呈现给三个大语言模型（ChatGPT-4、Gemini 2.0、豆包）和三名获得委员会认证的睡眠医生。将鉴别诊断结果与美国睡眠医学会参考列表进行比较（平均匹配百分比），并根据美国睡眠医学会的最终诊断对最终诊断结果进行评分（3分李克特量表：0 = 不匹配，1 = 部分匹配，2 = 完全匹配）。

结果

鉴别诊断分析显示，ChatGPT-4（76.7%）、Gemini 2.0（77.7%）、豆包（70.7%）和医生的平均（72.9%）的平均一致性百分比相似。重复测量方差分析表明，大语言模型和医生之间在鉴别诊断准确性上没有统计学显著差异（p = 0.839）。对于最终诊断，所有三个大语言模型的平均一致性得分相同（87.5%），落在经验丰富的医生的表现范围内（81.3% - 96.9%），表明大语言模型的诊断能力与这些病例上的专家相当。非参数弗里德曼检验显示各实体之间没有统计学显著差异（p = 0.602）。比较平均最终诊断得分的配对t检验也没有显示出显著差异（p = 0.606）。