• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

GPT-3.5 和 GPT-4 与医学生在书面德语文凭考试中的表现比较:观察性研究。

Comparison of the Performance of GPT-3.5 and GPT-4 With That of Medical Students on the Written German Medical Licensing Examination: Observational Study.

机构信息

Institute for Clinical Chemistry, University Hospital Cologne, Cologne, Germany.

Department of General Surgery, Visceral, Thoracic and Vascular Surgery, University Hospital Greifswald, Greifswald, Germany.

出版信息

JMIR Med Educ. 2024 Feb 8;10:e50965. doi: 10.2196/50965.

DOI:10.2196/50965
PMID:38329802
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10884900/
Abstract

BACKGROUND

The potential of artificial intelligence (AI)-based large language models, such as ChatGPT, has gained significant attention in the medical field. This enthusiasm is driven not only by recent breakthroughs and improved accessibility, but also by the prospect of democratizing medical knowledge and promoting equitable health care. However, the performance of ChatGPT is substantially influenced by the input language, and given the growing public trust in this AI tool compared to that in traditional sources of information, investigating its medical accuracy across different languages is of particular importance.

OBJECTIVE

This study aimed to compare the performance of GPT-3.5 and GPT-4 with that of medical students on the written German medical licensing examination.

METHODS

To assess GPT-3.5's and GPT-4's medical proficiency, we used 937 original multiple-choice questions from 3 written German medical licensing examinations in October 2021, April 2022, and October 2022.

RESULTS

GPT-4 achieved an average score of 85% and ranked in the 92.8th, 99.5th, and 92.6th percentiles among medical students who took the same examinations in October 2021, April 2022, and October 2022, respectively. This represents a substantial improvement of 27% compared to GPT-3.5, which only passed 1 out of the 3 examinations. While GPT-3.5 performed well in psychiatry questions, GPT-4 exhibited strengths in internal medicine and surgery but showed weakness in academic research.

CONCLUSIONS

The study results highlight ChatGPT's remarkable improvement from moderate (GPT-3.5) to high competency (GPT-4) in answering medical licensing examination questions in German. While GPT-4's predecessor (GPT-3.5) was imprecise and inconsistent, it demonstrates considerable potential to improve medical education and patient care, provided that medically trained users critically evaluate its results. As the replacement of search engines by AI tools seems possible in the future, further studies with nonprofessional questions are needed to assess the safety and accuracy of ChatGPT for the general population.

摘要

背景

基于人工智能(AI)的大型语言模型,如 ChatGPT,在医学领域引起了广泛关注。这种热情不仅源于最近的突破和改进的可及性,还源于使医学知识民主化和促进公平医疗保健的前景。然而,ChatGPT 的性能受到输入语言的极大影响,并且鉴于公众对这种人工智能工具的信任度日益提高,与传统信息来源相比,调查其在不同语言中的医学准确性尤为重要。

目的

本研究旨在比较 GPT-3.5 和 GPT-4 在德国医学执照考试书面考试中的表现与医学生的表现。

方法

为了评估 GPT-3.5 的医学能力,我们使用了 2021 年 10 月、2022 年 4 月和 2022 年 10 月的三次书面德国医学执照考试中的 937 个原始多项选择题。

结果

GPT-4 的平均得分为 85%,在参加 2021 年 10 月、2022 年 4 月和 2022 年 10 月考试的医学生中分别排在第 92.8%、99.5%和 92.6%的百分位。与仅通过三次考试中的一次的 GPT-3.5 相比,这一表现提高了 27%。虽然 GPT-3.5 在精神病学问题上表现出色,但 GPT-4 在内科和外科方面表现出色,但在学术研究方面表现不佳。

结论

研究结果突出表明,ChatGPT 从中等水平(GPT-3.5)到高水平(GPT-4)在回答德国医学执照考试问题方面的能力有了显著提高。虽然 GPT-3.5 不够精确和一致,但它具有很大的潜力可以改善医学教育和患者护理,前提是接受过医学培训的用户对其结果进行批判性评估。随着未来人工智能工具可能取代搜索引擎,还需要进行更多非专业问题的研究,以评估 ChatGPT 对普通民众的安全性和准确性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9dfa/10884900/497347fd91ea/mededu_v10i1e50965_fig3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9dfa/10884900/e0c85bef16cf/mededu_v10i1e50965_fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9dfa/10884900/88a3d0634be8/mededu_v10i1e50965_fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9dfa/10884900/497347fd91ea/mededu_v10i1e50965_fig3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9dfa/10884900/e0c85bef16cf/mededu_v10i1e50965_fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9dfa/10884900/88a3d0634be8/mededu_v10i1e50965_fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9dfa/10884900/497347fd91ea/mededu_v10i1e50965_fig3.jpg

相似文献

1
Comparison of the Performance of GPT-3.5 and GPT-4 With That of Medical Students on the Written German Medical Licensing Examination: Observational Study.GPT-3.5 和 GPT-4 与医学生在书面德语文凭考试中的表现比较:观察性研究。
JMIR Med Educ. 2024 Feb 8;10:e50965. doi: 10.2196/50965.
2
Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: Systematic Review and Meta-Analysis.ChatGPT 在全球医学执照考试不同版本中的表现:系统评价和荟萃分析。
J Med Internet Res. 2024 Jul 25;26:e60807. doi: 10.2196/60807.
3
Performance of ChatGPT-3.5 and GPT-4 in national licensing examinations for medicine, pharmacy, dentistry, and nursing: a systematic review and meta-analysis.ChatGPT-3.5 和 GPT-4 在医学、药学、牙科和护理国家执照考试中的表现:系统评价和荟萃分析。
BMC Med Educ. 2024 Sep 16;24(1):1013. doi: 10.1186/s12909-024-05944-8.
4
Exploring the Performance of ChatGPT Versions 3.5, 4, and 4 With Vision in the Chilean Medical Licensing Examination: Observational Study.探讨 ChatGPT 版本 3.5、4 和 4 与 Vision 在智利医师执照考试中的表现:观察性研究。
JMIR Med Educ. 2024 Apr 29;10:e55048. doi: 10.2196/55048.
5
Accuracy of ChatGPT on Medical Questions in the National Medical Licensing Examination in Japan: Evaluation Study.ChatGPT在日本国家医师资格考试医学问题上的准确性:评估研究
JMIR Form Res. 2023 Oct 13;7:e48023. doi: 10.2196/48023.
6
Performance of ChatGPT on the Peruvian National Licensing Medical Examination: Cross-Sectional Study.ChatGPT在秘鲁国家医学执照考试中的表现:横断面研究
JMIR Med Educ. 2023 Sep 28;9:e48039. doi: 10.2196/48039.
7
Influence of Model Evolution and System Roles on ChatGPT's Performance in Chinese Medical Licensing Exams: Comparative Study.模型演进和系统角色对 ChatGPT 在中文医师资格考试中表现的影响:对比研究。
JMIR Med Educ. 2024 Aug 13;10:e52784. doi: 10.2196/52784.
8
Appraisal of ChatGPT's Aptitude for Medical Education: Comparative Analysis With Third-Year Medical Students in a Pulmonology Examination.评估 ChatGPT 在医学教育中的能力:与三年级医学生在肺病学考试中的比较分析。
JMIR Med Educ. 2024 Jul 23;10:e52818. doi: 10.2196/52818.
9
Comparing the Performance of Popular Large Language Models on the National Board of Medical Examiners Sample Questions.比较流行的大语言模型在国家医学考试委员会样题上的表现。
Cureus. 2024 Mar 11;16(3):e55991. doi: 10.7759/cureus.55991. eCollection 2024 Mar.
10
Performance of ChatGPT on the Chinese Postgraduate Examination for Clinical Medicine: Survey Study.ChatGPT 在临床医学研究生入学考试中的表现:调查研究。
JMIR Med Educ. 2024 Feb 9;10:e48514. doi: 10.2196/48514.

引用本文的文献

1
Utility of Generative Artificial Intelligence for Japanese Medical Interview Training: Randomized Crossover Pilot Study.生成式人工智能在日本医学面试培训中的效用:随机交叉试点研究。
JMIR Med Educ. 2025 Aug 1;11:e77332. doi: 10.2196/77332.
2
Large language models in medical education: a comparative cross-platform evaluation in answering histological questions.医学教育中的大语言模型:回答组织学问题的比较性跨平台评估
Med Educ Online. 2025 Dec;30(1):2534065. doi: 10.1080/10872981.2025.2534065. Epub 2025 Jul 12.
3
ChatGPT performance in answering medical residency questions in nephrology: a pilot study in Brazil.

本文引用的文献

1
ChatGPT makes medicine easy to swallow: an exploratory case study on simplified radiology reports.ChatGPT 让医学文献通俗易懂:简化放射学报告的探索性案例研究。
Eur Radiol. 2024 May;34(5):2817-2825. doi: 10.1007/s00330-023-10213-1. Epub 2023 Oct 5.
2
ChatGPT Passes German State Examination in Medicine With Picture Questions Omitted.ChatGPT在省略图片问题的情况下通过了德国医学国家考试。
Dtsch Arztebl Int. 2023 May 30;120(21):373-374. doi: 10.3238/arztebl.m2023.0113.
3
Large language models in medicine.医学中的大型语言模型。
ChatGPT在回答巴西肾脏科住院医师问题方面的表现:一项试点研究
J Bras Nefrol. 2025 Oct-Dec;47(4):e20240254. doi: 10.1590/2175-8239-JBN-2024-0254en.
4
Comparing ChatGPT3.5 and Bard recommendations for colonoscopy intervals: Bridging the gap in healthcare settings.比较ChatGPT3.5和Bard关于结肠镜检查间隔的建议:弥合医疗环境中的差距。
Endosc Int Open. 2025 Jun 17;13:a25865912. doi: 10.1055/a-2586-5912. eCollection 2025.
5
Comparative analysis of ChatGPT 3.5 and ChatGPT 4 obstetric and gynecological knowledge.ChatGPT 3.5与ChatGPT 4妇产科知识的对比分析
Sci Rep. 2025 Jul 1;15(1):21133. doi: 10.1038/s41598-025-08424-1.
6
The PERFORM Study: Artificial Intelligence Versus Human Residents in Cross-Sectional Obstetrics-Gynecology Scenarios Across Languages and Time Constraints.PERFORM研究:跨语言和时间限制的妇产科横断面场景中人工智能与住院医师的比较
Mayo Clin Proc Digit Health. 2025 Mar 8;3(2):100206. doi: 10.1016/j.mcpdig.2025.100206. eCollection 2025 Jun.
7
Advancing medical AI: GPT-4 and GPT-4o surpass GPT-3.5 in Taiwanese medical licensing exams.推进医学人工智能:GPT-4和GPT-4o在台湾医学执照考试中超越GPT-3.5。
PLoS One. 2025 Jun 4;20(6):e0324841. doi: 10.1371/journal.pone.0324841. eCollection 2025.
8
The Performance of AI in Dermatology Exams: The Exam Success and Limits of ChatGPT.人工智能在皮肤科考试中的表现:ChatGPT的考试成效与局限
J Cosmet Dermatol. 2025 May;24(5):e70244. doi: 10.1111/jocd.70244.
9
Automated extraction of functional biomarkers of verbal and ambulatory ability from multi-institutional clinical notes using large language models.使用大语言模型从多机构临床记录中自动提取言语和行动能力的功能生物标志物。
J Neurodev Disord. 2025 Apr 30;17(1):24. doi: 10.1186/s11689-025-09612-w.
10
Accuracy of Large Language Models When Answering Clinical Research Questions: Systematic Review and Network Meta-Analysis.大型语言模型回答临床研究问题的准确性:系统评价与网络荟萃分析
J Med Internet Res. 2025 Apr 30;27:e64486. doi: 10.2196/64486.
Nat Med. 2023 Aug;29(8):1930-1940. doi: 10.1038/s41591-023-02448-8. Epub 2023 Jul 17.
4
Variability in Large Language Models' Responses to Medical Licensing and Certification Examinations. Comment on "How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment".大语言模型对医学执照和认证考试的回答的变异性。对《ChatGPT在美国医学执照考试中的表现如何?大语言模型对医学教育和知识评估的影响》的评论
JMIR Med Educ. 2023 Jul 13;9:e48305. doi: 10.2196/48305.
5
Data Science as a Core Competency in Undergraduate Medical Education in the Age of Artificial Intelligence in Health Care.在医疗保健人工智能时代,数据科学作为本科医学教育的核心能力。
JMIR Med Educ. 2023 Jul 11;9:e46344. doi: 10.2196/46344.
6
Putting ChatGPT's Medical Advice to the (Turing) Test: Survey Study.对ChatGPT的医学建议进行(图灵)测试:调查研究。
JMIR Med Educ. 2023 Jul 10;9:e46939. doi: 10.2196/46939.
7
Performance of GPT-3.5 and GPT-4 on the Japanese Medical Licensing Examination: Comparison Study.GPT-3.5和GPT-4在日本医师执照考试中的表现:比较研究。
JMIR Med Educ. 2023 Jun 29;9:e48002. doi: 10.2196/48002.
8
Analysis of large-language model versus human performance for genetics questions.大语言模型与人类在遗传学问题表现上的分析。
Eur J Hum Genet. 2024 Apr;32(4):466-468. doi: 10.1038/s41431-023-01396-8. Epub 2023 May 29.
9
Comparing scientific abstracts generated by ChatGPT to real abstracts with detectors and blinded human reviewers.使用检测器和不知情的人类评审员,将ChatGPT生成的科学摘要与真实摘要进行比较。
NPJ Digit Med. 2023 Apr 26;6(1):75. doi: 10.1038/s41746-023-00819-6.
10
Trialling a Large Language Model (ChatGPT) in General Practice With the Applied Knowledge Test: Observational Study Demonstrating Opportunities and Limitations in Primary Care.在全科医疗中使用应用知识测试对大型语言模型(ChatGPT)进行试验:观察性研究揭示初级保健中的机遇与局限
JMIR Med Educ. 2023 Apr 21;9:e46599. doi: 10.2196/46599.