GPT-3人工智能模型的诊断与分诊准确性

The Diagnostic and Triage Accuracy of the GPT-3 Artificial Intelligence Model.

作者信息

Levine David M, Tuwani Rudraksh, Kompa Benjamin, Varma Amita, Finlayson Samuel G, Mehrotra Ateev, Beam Andrew

机构信息

Division of General Internal Medicine and Primary Care, Brigham and Women's Hospital; Boston, MA, USA.

Harvard Medical School; Boston, MA, USA.

出版信息

medRxiv. 2023 Feb 1:2023.01.30.23285067. doi: 10.1101/2023.01.30.23285067.

DOI:10.1101/2023.01.30.23285067

PMID:36778449

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9915829/

Abstract

IMPORTANCE

Artificial intelligence (AI) applications in health care have been effective in many areas of medicine, but they are often trained for a single task using labeled data, making deployment and generalizability challenging. Whether a general-purpose AI language model can perform diagnosis and triage is unknown.

OBJECTIVE

Compare the general-purpose Generative Pre-trained Transformer 3 (GPT-3) AI model's diagnostic and triage performance to attending physicians and lay adults who use the Internet.

DESIGN

We compared the accuracy of GPT-3's diagnostic and triage ability for 48 validated case vignettes of both common (e.g., viral illness) and severe (e.g., heart attack) conditions to lay people and practicing physicians. Finally, we examined how well calibrated GPT-3's confidence was for diagnosis and triage.

SETTING AND PARTICIPANTS

The GPT-3 model, a nationally representative sample of lay people, and practicing physicians.

EXPOSURE

Validated case vignettes (<60 words; <6 grade reading level).

MAIN OUTCOMES AND MEASURES

Correct diagnosis, correct triage.

RESULTS

Among all cases, GPT-3 replied with the correct diagnosis in its top 3 for 88% (95% CI, 75% to 94%) of cases, compared to 54% (95% CI, 53% to 55%) for lay individuals (p<0.001) and 96% (95% CI, 94% to 97%) for physicians (p=0.0354). GPT-3 triaged (71% correct; 95% CI, 57% to 82%) similarly to lay individuals (74%; 95% CI, 73% to 75%; p=0.73); both were significantly worse than physicians (91%; 95% CI, 89% to 93%; p<0.001). As measured by the Brier score, GPT-3 confidence in its top prediction was reasonably well-calibrated for diagnosis (Brier score = 0.18) and triage (Brier score = 0.22).

CONCLUSIONS AND RELEVANCE

A general-purpose AI language model without any content-specific training could perform diagnosis at levels close to, but below physicians and better than lay individuals. The model was performed less well on triage, where its performance was closer to that of lay individuals.

摘要

重要性

人工智能（AI）在医疗保健领域的应用在许多医学领域都取得了成效，但它们通常使用标记数据针对单一任务进行训练，这使得部署和通用性具有挑战性。通用人工智能语言模型是否能够进行诊断和分诊尚不清楚。

目的

将通用的生成式预训练变换器3（GPT-3）人工智能模型的诊断和分诊性能与使用互联网的主治医生和普通成年人进行比较。

设计

我们将GPT-3对48个经过验证的常见（如病毒感染）和严重（如心脏病发作）病例的诊断和分诊能力的准确性，与普通人和执业医生进行了比较。最后，我们研究了GPT-3在诊断和分诊方面的置信度校准情况。

设置和参与者

GPT-3模型、具有全国代表性的普通人群样本和执业医生。

暴露

经过验证的病例 vignettes（<60字；<6年级阅读水平）。

主要结果和测量指标

正确诊断、正确分诊。

结果

在所有病例中，GPT-3在其前三个回复中给出正确诊断的比例为88%（95%置信区间，75%至94%），而普通个体为54%（95%置信区间，53%至55%）（p<0.001），医生为96%（95%置信区间，94%至97%）（p=0.0354）。GPT-3进行分诊的比例（71%正确；95%置信区间，57%至82%）与普通个体相似（74%；95%置信区间，73%至75%；p=0.73）；两者均显著低于医生（91%；95%置信区间89%至93%；p<0.001）。通过布里尔评分衡量，GPT-3对其最高预测的置信度在诊断（布里尔评分为0.18）和分诊（布里尔评分为0.22）方面校准得相当合理良好。