• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

GPT-3 人工智能模型的诊断和分诊准确性:一项观察性研究。

The diagnostic and triage accuracy of the GPT-3 artificial intelligence model: an observational study.

机构信息

Division of General Internal Medicine and Primary Care, Brigham and Women's Hospital, Boston, MA, USA; Harvard Medical School, Boston, MA, USA.

Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA; Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, USA.

出版信息

Lancet Digit Health. 2024 Aug;6(8):e555-e561. doi: 10.1016/S2589-7500(24)00097-9.

DOI:10.1016/S2589-7500(24)00097-9
PMID:39059888
Abstract

BACKGROUND

Artificial intelligence (AI) applications in health care have been effective in many areas of medicine, but they are often trained for a single task using labelled data, making deployment and generalisability challenging. How well a general-purpose AI language model performs diagnosis and triage relative to physicians and laypeople is not well understood.

METHODS

We compared the predictive accuracy of Generative Pre-trained Transformer 3 (GPT-3)'s diagnostic and triage ability for 48 validated synthetic case vignettes (<50 words; sixth-grade reading level or below) of both common (eg, viral illness) and severe (eg, heart attack) conditions to a nationally representative sample of 5000 lay people from the USA who could use the internet to find the correct options and 21 practising physicians at Harvard Medical School. There were 12 vignettes for each of four triage categories: emergent, within one day, within 1 week, and self-care. The correct diagnosis and triage category (ie, ground truth) for each vignette was determined by two general internists at Harvard Medical School. For each vignette, human respondents and GPT-3 were prompted to list diagnoses in order of likelihood, and the vignette was marked as correct if the ground-truth diagnosis was in the top three of the listed diagnoses. For triage accuracy, we examined whether the human respondents' and GPT-3's selected triage was exactly correct according to the four triage categories, or matched a dichotomised triage variable (emergent or within 1 day vs within 1 week or self-care). We estimated GPT-3's diagnostic and triage confidence on a given vignette using a modified bootstrap resampling procedure, and examined how well calibrated GPT-3's confidence was by computing calibration curves and Brier scores. We also performed subgroup analysis by case acuity, and an error analysis for triage advice to characterise how its advice might affect patients using this tool to decide if they should seek medical care immediately.

FINDINGS

Among all cases, GPT-3 replied with the correct diagnosis in its top three for 88% (42/48, 95% CI 75-94) of cases, compared with 54% (2700/5000, 53-55) for lay individuals (p<0.0001) and 96% (637/666, 94-97) for physicians (p=0·012). GPT-3 triaged 70% correct (34/48, 57-82) versus 74% (3706/5000, 73-75; p=0.60) for lay individuals and 91% (608/666, 89-93%; p<0.0001) for physicians. As measured by the Brier score, GPT-3 confidence in its top prediction was reasonably well calibrated for diagnosis (Brier score=0·18) and triage (Brier score=0·22). We observed an inverse relationship between case acuity and GPT-3 accuracy (p<0·0001) with a fitted trend line of -8·33% decrease in accuracy for every level of increase in case acuity. For triage error analysis, GPT-3 deprioritised truly emergent cases in seven instances.

INTERPRETATION

A general-purpose AI language model without any content-specific training could perform diagnosis at levels close to, but below, physicians and better than lay individuals. We found that GPT-3's performance was inferior to physicians for triage, sometimes by a large margin, and its performance was closer to that of lay individuals. Although the diagnostic performance of GPT-3 was comparable to physicians, it was significantly better than a typical person using a search engine.

FUNDING

The National Heart, Lung, and Blood Institute.

摘要

背景

人工智能(AI)在医疗保健领域的应用在许多医学领域都非常有效,但它们通常是使用标记数据针对单一任务进行训练的,这使得部署和推广具有挑战性。通用 AI 语言模型在诊断和分诊方面的表现相对于医生和非专业人士的表现如何,目前还不太清楚。

方法

我们比较了生成式预训练转换器 3(GPT-3)在诊断和分诊方面的预测准确性,使用了 48 个经过验证的合成病例小插曲(<50 个单词;六年级阅读水平或以下),包括常见(例如,病毒感染)和严重(例如,心脏病发作)两种情况,这些小插曲是针对来自美国的 5000 名具有上网查找正确选项能力的非专业人士和 21 名哈佛医学院的执业医生进行的。每个分诊类别(紧急、一天内、一周内和自我护理)都有 12 个小插曲。每个小插曲的正确诊断和分诊类别(即地面真实)由哈佛医学院的两名普通内科医生确定。对于每个小插曲,人类受访者和 GPT-3 都被提示按可能性顺序列出诊断,只要地面真实诊断在列出的诊断的前三位,小插曲就被标记为正确。对于分诊准确性,我们检查了人类受访者和 GPT-3 选择的分诊是否完全正确,根据四个分诊类别,或者与二分变量(紧急或一天内 vs 一周内或自我护理)匹配。我们使用修改后的引导重采样程序估计 GPT-3 在给定小插曲上的诊断和分诊信心,并通过计算校准曲线和 Brier 分数来检查 GPT-3 的信心校准程度。我们还按病例严重程度进行了亚组分析,并对分诊建议进行了错误分析,以描述使用该工具决定是否立即寻求医疗护理时,其建议可能如何影响患者。

结果

在所有情况下,GPT-3 在其前三个回答中给出正确诊断的比例为 88%(42/48,95%CI 75-94),而非专业人士的比例为 54%(2700/5000,53-55)(p<0.0001),医生的比例为 96%(637/666,94-97)(p=0.012)。GPT-3 的分诊正确率为 70%(34/48,57-82),而非专业人士的分诊正确率为 74%(3706/5000,73-75;p=0.60),医生的分诊正确率为 91%(608/666,89-93%;p<0.0001)。根据 Brier 分数,GPT-3 对其最高预测的置信度在诊断(Brier 分数=0.18)和分诊(Brier 分数=0.22)方面校准得相当好。我们观察到病例严重程度与 GPT-3 准确性之间存在反比关系(p<0.0001),病例严重程度每增加一级,准确性就会降低 8.33%。对于分诊错误分析,GPT-3 在七个实例中低估了真正紧急的病例。

解释

没有任何特定内容训练的通用 AI 语言模型可以达到接近但低于医生的水平进行诊断,并且优于非专业人士。我们发现,GPT-3 在分诊方面的表现不如医生,有时差距很大,其表现更接近非专业人士。尽管 GPT-3 的诊断性能与医生相当,但它明显优于使用搜索引擎的普通人。

资金来源

美国国立心肺血液研究所。

相似文献

1
The diagnostic and triage accuracy of the GPT-3 artificial intelligence model: an observational study.GPT-3 人工智能模型的诊断和分诊准确性:一项观察性研究。
Lancet Digit Health. 2024 Aug;6(8):e555-e561. doi: 10.1016/S2589-7500(24)00097-9.
2
Comparison of Two Modern Survival Prediction Tools, SORG-MLA and METSSS, in Patients With Symptomatic Long-bone Metastases Who Underwent Local Treatment With Surgery Followed by Radiotherapy and With Radiotherapy Alone.两种现代生存预测工具 SORG-MLA 和 METSSS 在接受手术联合放疗和单纯放疗治疗有症状长骨转移患者中的比较。
Clin Orthop Relat Res. 2024 Dec 1;482(12):2193-2208. doi: 10.1097/CORR.0000000000003185. Epub 2024 Jul 23.
3
Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.在基层医疗机构或医院门诊环境中,如果患者出现以下症状和体征,可判断其是否患有 COVID-19。
Cochrane Database Syst Rev. 2022 May 20;5(5):CD013665. doi: 10.1002/14651858.CD013665.pub3.
4
Are Current Survival Prediction Tools Useful When Treating Subsequent Skeletal-related Events From Bone Metastases?当前的生存预测工具在治疗骨转移后的骨骼相关事件时有用吗?
Clin Orthop Relat Res. 2024 Sep 1;482(9):1710-1721. doi: 10.1097/CORR.0000000000003030. Epub 2024 Mar 22.
5
Sexual Harassment and Prevention Training性骚扰与预防培训
6
Variation within and between digital pathology and light microscopy for the diagnosis of histopathology slides: blinded crossover comparison study.数字病理学与光学显微镜检查在组织病理学切片诊断中的内部及相互间差异:双盲交叉对比研究
Health Technol Assess. 2025 Jul;29(30):1-75. doi: 10.3310/SPLK4325.
7
123I-MIBG scintigraphy and 18F-FDG-PET imaging for diagnosing neuroblastoma.用于诊断神经母细胞瘤的123I-间碘苄胍闪烁扫描术和18F-氟代脱氧葡萄糖正电子发射断层显像
Cochrane Database Syst Rev. 2015 Sep 29;2015(9):CD009263. doi: 10.1002/14651858.CD009263.pub2.
8
Sertindole for schizophrenia.用于治疗精神分裂症的舍吲哚。
Cochrane Database Syst Rev. 2005 Jul 20;2005(3):CD001715. doi: 10.1002/14651858.CD001715.pub2.
9
Does the Presence of Missing Data Affect the Performance of the SORG Machine-learning Algorithm for Patients With Spinal Metastasis? Development of an Internet Application Algorithm.缺失数据的存在是否会影响 SORG 机器学习算法在脊柱转移瘤患者中的性能?开发一种互联网应用算法。
Clin Orthop Relat Res. 2024 Jan 1;482(1):143-157. doi: 10.1097/CORR.0000000000002706. Epub 2023 Jun 12.
10
The potential of Generative Pre-trained Transformer 4 (GPT-4) to analyse medical notes in three different languages: a retrospective model-evaluation study.生成式预训练变换器4(GPT-4)分析三种不同语言医学笔记的潜力:一项回顾性模型评估研究。
Lancet Digit Health. 2025 Jan;7(1):e35-e43. doi: 10.1016/S2589-7500(24)00246-2.

引用本文的文献

1
Artificial General Intelligence and Its Threat to Public Health.通用人工智能及其对公众健康的威胁。
J Eval Clin Pract. 2025 Sep;31(6):e70269. doi: 10.1111/jep.70269.
2
Token Probabilities to Mitigate Large Language Models Overconfidence in Answering Medical Questions: Quantitative Study.减轻大语言模型在回答医学问题时过度自信的令牌概率:定量研究
J Med Internet Res. 2025 Aug 29;27:e64348. doi: 10.2196/64348.
3
The impact of prompting on ChatGPT's adherence to status epilepticus treatment guidelines.提示对ChatGPT遵循癫痫持续状态治疗指南的影响。
Sci Rep. 2025 Aug 28;15(1):31712. doi: 10.1038/s41598-025-16902-9.
4
Large Language Models for Psychiatric Phenotype Extraction from Electronic Health Records.用于从电子健康记录中提取精神疾病表型的大语言模型
medRxiv. 2025 Aug 12:2025.08.07.25333172. doi: 10.1101/2025.08.07.25333172.
5
Assessing DeepSeek-R1 for Clinical Decision Support in Multidisciplinary Laboratory Medicine.评估用于多学科检验医学临床决策支持的DeepSeek-R1
J Multidiscip Healthc. 2025 Aug 12;18:4979-4988. doi: 10.2147/JMDH.S538253. eCollection 2025.
6
GPT-based prediction of short-term survival following decompressive hemicraniectomy in malignant middle cerebral artery infarction.基于GPT对恶性大脑中动脉梗死减压性去骨瓣减压术后短期生存的预测
Front Neurol. 2025 Jul 24;16:1603536. doi: 10.3389/fneur.2025.1603536. eCollection 2025.
7
Performance of large language models in the differential diagnosis of benign and malignant biliary stricture.大语言模型在良性和恶性胆管狭窄鉴别诊断中的表现
Front Oncol. 2025 Jul 3;15:1613818. doi: 10.3389/fonc.2025.1613818. eCollection 2025.
8
Potential to perpetuate social biases in health care by Chinese large language models: a model evaluation study.中国大语言模型在医疗保健中延续社会偏见的可能性:一项模型评估研究
Int J Equity Health. 2025 Jul 15;24(1):206. doi: 10.1186/s12939-025-02581-5.
9
Evaluation of ChatGPT's performance in providing treatment recommendations for pediatric diseases.评估ChatGPT在提供儿科疾病治疗建议方面的表现。
Pediatr Discov. 2023 Nov 20;1(3):e42. doi: 10.1002/pdi3.42. eCollection 2023 Dec.
10
Large Language Models in Medical Diagnostics: Scoping Review With Bibliometric Analysis.医学诊断中的大语言模型:基于文献计量分析的综述
J Med Internet Res. 2025 Jun 9;27:e72062. doi: 10.2196/72062.