Suppr超能文献

ChatGPT(GPT-4)与医生在瑞典家庭医学专科考试复杂病例上的比较:一项观察性比较研究

ChatGPT (GPT-4) versus doctors on complex cases of the Swedish family medicine specialist examination: an observational comparative study.

作者信息

Arvidsson Rasmus, Gunnarsson Ronny, Entezarjou Artin, Sundemo David, Wikberg Carl

机构信息

General Practice / Family Medicine, School of Public Health and Community Medicine, Sahlgrenska Academy, University of Gothenburg Institute of Medicine, Gothenburg, Sweden

Hälsocentralen Sankt Hans, Praktikertjänst AB, Lund, Sweden.

出版信息

BMJ Open. 2024 Dec 26;14(12):e086148. doi: 10.1136/bmjopen-2024-086148.

Abstract

BACKGROUND

Recent breakthroughs in artificial intelligence research include the development of generative pretrained transformers (GPT). ChatGPT has been shown to perform well when answering several sets of medical multiple-choice questions. However, it has not been tested for writing free-text assessments of complex cases in primary care.

OBJECTIVES

To compare the performance of ChatGPT, version GPT-4, with that of real doctors.

DESIGN AND SETTING

A blinded observational comparative study conducted in the Swedish primary care setting. Responses from GPT-4 and real doctors to cases from the Swedish family medicine specialist examination were scored by blinded reviewers, and the scores were compared.

PARTICIPANTS

Anonymous responses from the Swedish family medicine specialist examination 2017-2022 were used.

OUTCOME MEASURES

Primary: the mean difference in scores between GPT-4's responses and randomly selected responses by human doctors, as well as between GPT-4's responses and top-tier responses by human doctors. Secondary: the correlation between differences in response length and response score; the intraclass correlation coefficient between reviewers; and the percentage of maximum score achieved by each group in different subject categories.

RESULTS

The mean scores were 6.0, 7.2 and 4.5 for randomly selected doctor responses, top-tier doctor responses and GPT-4 responses, respectively, on a 10-point scale. The scores for the random doctor responses were, on average, 1.6 points higher than those of GPT-4 (p<0.001, 95% CI 0.9 to 2.2) and the top-tier doctor scores were, on average, 2.7 points higher than those of GPT-4 (p<0.001, 95 % CI 2.2 to 3.3). Following the release of GPT-4o, the experiment was repeated, although this time with only a single reviewer scoring the answers. In this follow-up, random doctor responses were scored 0.7 points higher than those of GPT-4o (p=0.044).

CONCLUSION

In complex primary care cases, GPT-4 performs worse than human doctors taking the family medicine specialist examination. Future GPT-based chatbots may perform better, but comprehensive evaluations are needed before implementing chatbots for medical decision support in primary care.

摘要

背景

人工智能研究的最新突破包括生成式预训练变换器(GPT)的开发。ChatGPT在回答几组医学多项选择题时表现良好。然而,它尚未经过测试用于撰写初级保健中复杂病例的自由文本评估。

目的

比较ChatGPT的GPT-4版本与真正医生的表现。

设计与设置

在瑞典初级保健环境中进行的一项盲法观察性比较研究。由盲法评审人员对GPT-4和真正医生对瑞典家庭医学专科考试病例的回答进行评分,并比较分数。

参与者

使用了2017 - 2022年瑞典家庭医学专科考试的匿名回答。

结局指标

主要指标:GPT-4的回答与随机选择的医生回答之间的平均分数差异,以及GPT-4的回答与顶尖医生回答之间的平均分数差异。次要指标:回答长度差异与回答分数之间的相关性;评审人员之间的组内相关系数;以及每组在不同学科类别中获得的最高分百分比。

结果

在10分制中,随机选择的医生回答、顶尖医生回答和GPT-4回答的平均分数分别为6.0、7.2和4.5。随机医生回答的分数平均比GPT-4高1.6分(p<0.001,95%置信区间0.9至2.2),顶尖医生回答的分数平均比GPT-4高2.7分(p<0.001,95%置信区间2.2至3.3)。在GPT-4o发布后,重复了该实验,不过这次只有一名评审人员对答案进行评分。在这次后续研究中,随机医生回答的分数比GPT-4o高0.7分(p = 0.044)。

结论

在复杂的初级保健病例中,GPT-4的表现比参加家庭医学专科考试的医生差。未来基于GPT的聊天机器人可能表现得更好,但在将聊天机器人用于初级保健的医疗决策支持之前,需要进行全面评估。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2c76/11683950/54ef97fad1af/bmjopen-14-12-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验