Suppr超能文献

GPT-3 人工智能模型的诊断和分诊准确性:一项观察性研究。

The diagnostic and triage accuracy of the GPT-3 artificial intelligence model: an observational study.

机构信息

Division of General Internal Medicine and Primary Care, Brigham and Women's Hospital, Boston, MA, USA; Harvard Medical School, Boston, MA, USA.

Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA; Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, USA.

出版信息

Lancet Digit Health. 2024 Aug;6(8):e555-e561. doi: 10.1016/S2589-7500(24)00097-9.

Abstract

BACKGROUND

Artificial intelligence (AI) applications in health care have been effective in many areas of medicine, but they are often trained for a single task using labelled data, making deployment and generalisability challenging. How well a general-purpose AI language model performs diagnosis and triage relative to physicians and laypeople is not well understood.

METHODS

We compared the predictive accuracy of Generative Pre-trained Transformer 3 (GPT-3)'s diagnostic and triage ability for 48 validated synthetic case vignettes (<50 words; sixth-grade reading level or below) of both common (eg, viral illness) and severe (eg, heart attack) conditions to a nationally representative sample of 5000 lay people from the USA who could use the internet to find the correct options and 21 practising physicians at Harvard Medical School. There were 12 vignettes for each of four triage categories: emergent, within one day, within 1 week, and self-care. The correct diagnosis and triage category (ie, ground truth) for each vignette was determined by two general internists at Harvard Medical School. For each vignette, human respondents and GPT-3 were prompted to list diagnoses in order of likelihood, and the vignette was marked as correct if the ground-truth diagnosis was in the top three of the listed diagnoses. For triage accuracy, we examined whether the human respondents' and GPT-3's selected triage was exactly correct according to the four triage categories, or matched a dichotomised triage variable (emergent or within 1 day vs within 1 week or self-care). We estimated GPT-3's diagnostic and triage confidence on a given vignette using a modified bootstrap resampling procedure, and examined how well calibrated GPT-3's confidence was by computing calibration curves and Brier scores. We also performed subgroup analysis by case acuity, and an error analysis for triage advice to characterise how its advice might affect patients using this tool to decide if they should seek medical care immediately.

FINDINGS

Among all cases, GPT-3 replied with the correct diagnosis in its top three for 88% (42/48, 95% CI 75-94) of cases, compared with 54% (2700/5000, 53-55) for lay individuals (p<0.0001) and 96% (637/666, 94-97) for physicians (p=0·012). GPT-3 triaged 70% correct (34/48, 57-82) versus 74% (3706/5000, 73-75; p=0.60) for lay individuals and 91% (608/666, 89-93%; p<0.0001) for physicians. As measured by the Brier score, GPT-3 confidence in its top prediction was reasonably well calibrated for diagnosis (Brier score=0·18) and triage (Brier score=0·22). We observed an inverse relationship between case acuity and GPT-3 accuracy (p<0·0001) with a fitted trend line of -8·33% decrease in accuracy for every level of increase in case acuity. For triage error analysis, GPT-3 deprioritised truly emergent cases in seven instances.

INTERPRETATION

A general-purpose AI language model without any content-specific training could perform diagnosis at levels close to, but below, physicians and better than lay individuals. We found that GPT-3's performance was inferior to physicians for triage, sometimes by a large margin, and its performance was closer to that of lay individuals. Although the diagnostic performance of GPT-3 was comparable to physicians, it was significantly better than a typical person using a search engine.

FUNDING

The National Heart, Lung, and Blood Institute.

摘要

背景

人工智能(AI)在医疗保健领域的应用在许多医学领域都非常有效,但它们通常是使用标记数据针对单一任务进行训练的,这使得部署和推广具有挑战性。通用 AI 语言模型在诊断和分诊方面的表现相对于医生和非专业人士的表现如何,目前还不太清楚。

方法

我们比较了生成式预训练转换器 3(GPT-3)在诊断和分诊方面的预测准确性,使用了 48 个经过验证的合成病例小插曲(<50 个单词;六年级阅读水平或以下),包括常见(例如,病毒感染)和严重(例如,心脏病发作)两种情况,这些小插曲是针对来自美国的 5000 名具有上网查找正确选项能力的非专业人士和 21 名哈佛医学院的执业医生进行的。每个分诊类别(紧急、一天内、一周内和自我护理)都有 12 个小插曲。每个小插曲的正确诊断和分诊类别(即地面真实)由哈佛医学院的两名普通内科医生确定。对于每个小插曲,人类受访者和 GPT-3 都被提示按可能性顺序列出诊断,只要地面真实诊断在列出的诊断的前三位,小插曲就被标记为正确。对于分诊准确性,我们检查了人类受访者和 GPT-3 选择的分诊是否完全正确,根据四个分诊类别,或者与二分变量(紧急或一天内 vs 一周内或自我护理)匹配。我们使用修改后的引导重采样程序估计 GPT-3 在给定小插曲上的诊断和分诊信心,并通过计算校准曲线和 Brier 分数来检查 GPT-3 的信心校准程度。我们还按病例严重程度进行了亚组分析,并对分诊建议进行了错误分析,以描述使用该工具决定是否立即寻求医疗护理时,其建议可能如何影响患者。

结果

在所有情况下,GPT-3 在其前三个回答中给出正确诊断的比例为 88%(42/48,95%CI 75-94),而非专业人士的比例为 54%(2700/5000,53-55)(p<0.0001),医生的比例为 96%(637/666,94-97)(p=0.012)。GPT-3 的分诊正确率为 70%(34/48,57-82),而非专业人士的分诊正确率为 74%(3706/5000,73-75;p=0.60),医生的分诊正确率为 91%(608/666,89-93%;p<0.0001)。根据 Brier 分数,GPT-3 对其最高预测的置信度在诊断(Brier 分数=0.18)和分诊(Brier 分数=0.22)方面校准得相当好。我们观察到病例严重程度与 GPT-3 准确性之间存在反比关系(p<0.0001),病例严重程度每增加一级,准确性就会降低 8.33%。对于分诊错误分析,GPT-3 在七个实例中低估了真正紧急的病例。

解释

没有任何特定内容训练的通用 AI 语言模型可以达到接近但低于医生的水平进行诊断,并且优于非专业人士。我们发现,GPT-3 在分诊方面的表现不如医生,有时差距很大,其表现更接近非专业人士。尽管 GPT-3 的诊断性能与医生相当,但它明显优于使用搜索引擎的普通人。

资金来源

美国国立心肺血液研究所。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验