ChatGPT（GPT-4）与医生在瑞典家庭医学专科考试复杂病例上的比较：一项观察性比较研究

ChatGPT (GPT-4) versus doctors on complex cases of the Swedish family medicine specialist examination: an observational comparative study.

作者信息

Arvidsson Rasmus, Gunnarsson Ronny, Entezarjou Artin, Sundemo David, Wikberg Carl

机构信息

General Practice / Family Medicine, School of Public Health and Community Medicine, Sahlgrenska Academy, University of Gothenburg Institute of Medicine, Gothenburg, Sweden

Hälsocentralen Sankt Hans, Praktikertjänst AB, Lund, Sweden.

出版信息

BMJ Open. 2024 Dec 26;14(12):e086148. doi: 10.1136/bmjopen-2024-086148.

DOI:10.1136/bmjopen-2024-086148

PMID:39730155

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11683950/

Abstract

BACKGROUND

Recent breakthroughs in artificial intelligence research include the development of generative pretrained transformers (GPT). ChatGPT has been shown to perform well when answering several sets of medical multiple-choice questions. However, it has not been tested for writing free-text assessments of complex cases in primary care.

OBJECTIVES

To compare the performance of ChatGPT, version GPT-4, with that of real doctors.

DESIGN AND SETTING

A blinded observational comparative study conducted in the Swedish primary care setting. Responses from GPT-4 and real doctors to cases from the Swedish family medicine specialist examination were scored by blinded reviewers, and the scores were compared.

PARTICIPANTS

Anonymous responses from the Swedish family medicine specialist examination 2017-2022 were used.

OUTCOME MEASURES

Primary: the mean difference in scores between GPT-4's responses and randomly selected responses by human doctors, as well as between GPT-4's responses and top-tier responses by human doctors. Secondary: the correlation between differences in response length and response score; the intraclass correlation coefficient between reviewers; and the percentage of maximum score achieved by each group in different subject categories.

RESULTS

The mean scores were 6.0, 7.2 and 4.5 for randomly selected doctor responses, top-tier doctor responses and GPT-4 responses, respectively, on a 10-point scale. The scores for the random doctor responses were, on average, 1.6 points higher than those of GPT-4 (p<0.001, 95% CI 0.9 to 2.2) and the top-tier doctor scores were, on average, 2.7 points higher than those of GPT-4 (p<0.001, 95 % CI 2.2 to 3.3). Following the release of GPT-4o, the experiment was repeated, although this time with only a single reviewer scoring the answers. In this follow-up, random doctor responses were scored 0.7 points higher than those of GPT-4o (p=0.044).

CONCLUSION

In complex primary care cases, GPT-4 performs worse than human doctors taking the family medicine specialist examination. Future GPT-based chatbots may perform better, but comprehensive evaluations are needed before implementing chatbots for medical decision support in primary care.

摘要

背景

人工智能研究的最新突破包括生成式预训练变换器（GPT）的开发。ChatGPT在回答几组医学多项选择题时表现良好。然而，它尚未经过测试用于撰写初级保健中复杂病例的自由文本评估。

目的

比较ChatGPT的GPT-4版本与真正医生的表现。

设计与设置

在瑞典初级保健环境中进行的一项盲法观察性比较研究。由盲法评审人员对GPT-4和真正医生对瑞典家庭医学专科考试病例的回答进行评分，并比较分数。

参与者

使用了2017 - 2022年瑞典家庭医学专科考试的匿名回答。

结局指标

主要指标：GPT-4的回答与随机选择的医生回答之间的平均分数差异，以及GPT-4的回答与顶尖医生回答之间的平均分数差异。次要指标：回答长度差异与回答分数之间的相关性；评审人员之间的组内相关系数；以及每组在不同学科类别中获得的最高分百分比。

结果

在10分制中，随机选择的医生回答、顶尖医生回答和GPT-4回答的平均分数分别为6.0、7.2和4.5。随机医生回答的分数平均比GPT-4高1.6分（p<0.001，95%置信区间0.9至2.2），顶尖医生回答的分数平均比GPT-4高2.7分（p<0.001，95%置信区间2.2至3.3）。在GPT-4o发布后，重复了该实验，不过这次只有一名评审人员对答案进行评分。在这次后续研究中，随机医生回答的分数比GPT-4o高0.7分（p = 0.044）。

结论

在复杂的初级保健病例中，GPT-4的表现比参加家庭医学专科考试的医生差。未来基于GPT的聊天机器人可能表现得更好，但在将聊天机器人用于初级保健的医疗决策支持之前，需要进行全面评估。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2c76/11683950/54ef97fad1af/bmjopen-14-12-g001.jpg

相似文献

ChatGPT (GPT-4) versus doctors on complex cases of the Swedish family medicine specialist examination: an observational comparative study.ChatGPT（GPT-4）与医生在瑞典家庭医学专科考试复杂病例上的比较：一项观察性比较研究

BMJ Open. 2024 Dec 26;14(12):e086148. doi: 10.1136/bmjopen-2024-086148.

Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: Systematic Review and Meta-Analysis.ChatGPT 在全球医学执照考试不同版本中的表现：系统评价和荟萃分析。

J Med Internet Res. 2024 Jul 25;26:e60807. doi: 10.2196/60807.

A comparative analysis of GPT-3.5 and GPT-4.0 on a multiple-choice ophthalmology question bank: A study on artificial intelligence developments.基于多项选择题眼科题库对GPT-3.5和GPT-4.0的比较分析：一项关于人工智能发展的研究。

Rom J Ophthalmol. 2024 Oct-Dec;68(4):367-371. doi: 10.22336/rjo.2024.67.

Comparison of the Performance of GPT-3.5 and GPT-4 With That of Medical Students on the Written German Medical Licensing Examination: Observational Study.GPT-3.5 和 GPT-4 与医学生在书面德语文凭考试中的表现比较：观察性研究。

JMIR Med Educ. 2024 Feb 8;10:e50965. doi: 10.2196/50965.

The future of AI clinicians: assessing the modern standard of chatbots and their approach to diagnostic uncertainty.人工智能临床医生的未来：评估现代聊天机器人的标准及其对诊断不确定性的处理方法。

BMC Med Educ. 2024 Oct 11;24(1):1133. doi: 10.1186/s12909-024-06115-5.

Performance of ChatGPT in the In-Training Examination for Anesthesiology and Pain Medicine Residents in South Korea: Observational Study.ChatGPT在韩国麻醉学与疼痛医学住院医师培训考试中的表现：观察性研究

JMIR Med Educ. 2024 Sep 16;10:e56859. doi: 10.2196/56859.

Generative pre-trained transformer 4o (GPT-4o) in solving text-based multiple response questions for European Diploma in Radiology (EDiR): a comparative study with radiologists.生成式预训练变换器4o（GPT-4o）用于解答欧洲放射学文凭（EDiR）基于文本的多项选择题：与放射科医生的对比研究

Insights Imaging. 2025 Mar 22;16(1):66. doi: 10.1186/s13244-025-01941-7.

The Rapid Development of Artificial Intelligence: GPT-4's Performance on Orthopedic Surgery Board Questions.人工智能的快速发展：GPT-4 在骨科手术委员会问题上的表现。

Orthopedics. 2024 Mar-Apr;47(2):e85-e89. doi: 10.3928/01477447-20230922-05. Epub 2023 Sep 27.

Exploring the Performance of ChatGPT-4 in the Taiwan Audiologist Qualification Examination: Preliminary Observational Study Highlighting the Potential of AI Chatbots in Hearing Care.探索 ChatGPT-4 在台湾听力学家资格考试中的表现：初步观察性研究强调 AI 聊天机器人在听力保健中的潜力。

JMIR Med Educ. 2024 Apr 26;10:e55595. doi: 10.2196/55595.

Assessing Generative Pretrained Transformers (GPT) in Clinical Decision-Making: Comparative Analysis of GPT-3.5 and GPT-4.评估生成式预训练转换器（GPT）在临床决策中的应用：GPT-3.5 和 GPT-4 的对比分析。

J Med Internet Res. 2024 Jun 27;26:e54571. doi: 10.2196/54571.

引用本文的文献

Evaluating a Nationally Localized AI Chatbot for Personalized Primary Care Guidance: Insights from the HomeDOCtor Deployment in Slovenia.评估用于个性化初级保健指导的全国本地化人工智能聊天机器人：斯洛文尼亚HomeDOCtor部署的见解。

Healthcare (Basel). 2025 Jul 29;13(15):1843. doi: 10.3390/healthcare13151843.

Swedish Medical LLM Benchmark: development and evaluation of a framework for assessing large language models in the Swedish medical domain.瑞典医学大语言模型基准：瑞典医学领域大语言模型评估框架的开发与评估

Front Artif Intell. 2025 Jul 11;8:1557920. doi: 10.3389/frai.2025.1557920. eCollection 2025.

Harm Reduction Strategies for Thoughtful Use of Large Language Models in the Medical Domain: Perspectives for Patients and Clinicians.医学领域审慎使用大语言模型的危害降低策略：患者与临床医生的视角

J Med Internet Res. 2025 Jul 25;27:e75849. doi: 10.2196/75849.

Large language models in medical education: a comparative cross-platform evaluation in answering histological questions.医学教育中的大语言模型：回答组织学问题的比较性跨平台评估

Med Educ Online. 2025 Dec;30(1):2534065. doi: 10.1080/10872981.2025.2534065. Epub 2025 Jul 12.

Generative artificial intelligence for general practice; new potential ahead, but are we ready?用于全科医疗的生成式人工智能：前景可期，但我们准备好了吗？

Eur J Gen Pract. 2025 Dec;31(1):2511645. doi: 10.1080/13814788.2025.2511645. Epub 2025 Jun 6.

本文引用的文献

Comparative performance of humans versus GPT-4.0 and GPT-3.5 in the self-assessment program of American Academy of Ophthalmology.人类与 GPT-4.0 和 GPT-3.5 在眼科协会自我评估计划中的比较表现。

Sci Rep. 2023 Oct 29;13(1):18562. doi: 10.1038/s41598-023-45837-2.

Prompt Engineering as an Important Emerging Skill for Medical Professionals: Tutorial.医学专业人员的新兴技能：提示工程教程

J Med Internet Res. 2023 Oct 4;25:e50638. doi: 10.2196/50638.

Evaluating ChatGPT Performance on the Orthopaedic In-Training Examination.评估ChatGPT在骨科住院医师培训考试中的表现。

JB JS Open Access. 2023 Sep 8;8(3). doi: 10.2106/JBJS.OA.23.00056. eCollection 2023 Jul-Sep.

ChatGPT performance in the medical specialty exam: An observational study.ChatGPT 在医学专业考试中的表现：一项观察性研究。

Medicine (Baltimore). 2023 Aug 11;102(32):e34673. doi: 10.1097/MD.0000000000034673.

ChatGPT-3.5 and ChatGPT-4 dermatological knowledge level based on the Specialty Certificate Examination in Dermatology.基于皮肤病学专业证书考试的 ChatGPT-3.5 和 ChatGPT-4 皮肤科知识水平。

Clin Exp Dermatol. 2024 Jun 25;49(7):686-691. doi: 10.1093/ced/llad255.

Evaluating the Performance of ChatGPT in Ophthalmology: An Analysis of Its Successes and Shortcomings.评估ChatGPT在眼科领域的表现：对其优缺点的分析。

Ophthalmol Sci. 2023 May 5;3(4):100324. doi: 10.1016/j.xops.2023.100324. eCollection 2023 Dec.

ChatGPT failed Taiwan's Family Medicine Board Exam.ChatGPT 未能通过台湾家庭医学专科医师甄试。

J Chin Med Assoc. 2023 Aug 1;86(8):762-766. doi: 10.1097/JCMA.0000000000000946. Epub 2023 Jun 9.

Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum.比较医生和人工智能聊天机器人对发布在公共社交媒体论坛上的患者问题的回复。

JAMA Intern Med. 2023 Jun 1;183(6):589-596. doi: 10.1001/jamainternmed.2023.1838.

Trialling a Large Language Model (ChatGPT) in General Practice With the Applied Knowledge Test: Observational Study Demonstrating Opportunities and Limitations in Primary Care.在全科医疗中使用应用知识测试对大型语言模型（ChatGPT）进行试验：观察性研究揭示初级保健中的机遇与局限

JMIR Med Educ. 2023 Apr 21;9:e46599. doi: 10.2196/46599.

How Does ChatGPT Perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment.ChatGPT在美国医师执照考试（USMLE）中的表现如何？大语言模型对医学教育和知识评估的影响。

JMIR Med Educ. 2023 Feb 8;9:e45312. doi: 10.2196/45312.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

ChatGPT（GPT-4）与医生在瑞典家庭医学专科考试复杂病例上的比较：一项观察性比较研究

ChatGPT (GPT-4) versus doctors on complex cases of the Swedish family medicine specialist examination: an observational comparative study.

作者信息

机构信息

出版信息

BACKGROUND

OBJECTIVES

DESIGN AND SETTING

PARTICIPANTS

OUTCOME MEASURES

RESULTS

CONCLUSION

背景

目的

设计与设置

参与者

结局指标

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献