GPT-3.5 和 GPT-4 与医学生在书面德语文凭考试中的表现比较：观察性研究。

Comparison of the Performance of GPT-3.5 and GPT-4 With That of Medical Students on the Written German Medical Licensing Examination: Observational Study.

机构信息

Institute for Clinical Chemistry, University Hospital Cologne, Cologne, Germany.

Department of General Surgery, Visceral, Thoracic and Vascular Surgery, University Hospital Greifswald, Greifswald, Germany.

出版信息

JMIR Med Educ. 2024 Feb 8;10:e50965. doi: 10.2196/50965.

DOI:10.2196/50965

PMID:38329802

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10884900/

Abstract

BACKGROUND

The potential of artificial intelligence (AI)-based large language models, such as ChatGPT, has gained significant attention in the medical field. This enthusiasm is driven not only by recent breakthroughs and improved accessibility, but also by the prospect of democratizing medical knowledge and promoting equitable health care. However, the performance of ChatGPT is substantially influenced by the input language, and given the growing public trust in this AI tool compared to that in traditional sources of information, investigating its medical accuracy across different languages is of particular importance.

OBJECTIVE

This study aimed to compare the performance of GPT-3.5 and GPT-4 with that of medical students on the written German medical licensing examination.

METHODS

To assess GPT-3.5's and GPT-4's medical proficiency, we used 937 original multiple-choice questions from 3 written German medical licensing examinations in October 2021, April 2022, and October 2022.

RESULTS

GPT-4 achieved an average score of 85% and ranked in the 92.8th, 99.5th, and 92.6th percentiles among medical students who took the same examinations in October 2021, April 2022, and October 2022, respectively. This represents a substantial improvement of 27% compared to GPT-3.5, which only passed 1 out of the 3 examinations. While GPT-3.5 performed well in psychiatry questions, GPT-4 exhibited strengths in internal medicine and surgery but showed weakness in academic research.

CONCLUSIONS

The study results highlight ChatGPT's remarkable improvement from moderate (GPT-3.5) to high competency (GPT-4) in answering medical licensing examination questions in German. While GPT-4's predecessor (GPT-3.5) was imprecise and inconsistent, it demonstrates considerable potential to improve medical education and patient care, provided that medically trained users critically evaluate its results. As the replacement of search engines by AI tools seems possible in the future, further studies with nonprofessional questions are needed to assess the safety and accuracy of ChatGPT for the general population.

摘要

背景

基于人工智能（AI）的大型语言模型，如 ChatGPT，在医学领域引起了广泛关注。这种热情不仅源于最近的突破和改进的可及性，还源于使医学知识民主化和促进公平医疗保健的前景。然而，ChatGPT 的性能受到输入语言的极大影响，并且鉴于公众对这种人工智能工具的信任度日益提高，与传统信息来源相比，调查其在不同语言中的医学准确性尤为重要。

目的

本研究旨在比较 GPT-3.5 和 GPT-4 在德国医学执照考试书面考试中的表现与医学生的表现。

方法

为了评估 GPT-3.5 的医学能力，我们使用了 2021 年 10 月、2022 年 4 月和 2022 年 10 月的三次书面德国医学执照考试中的 937 个原始多项选择题。

结果

GPT-4 的平均得分为 85%，在参加 2021 年 10 月、2022 年 4 月和 2022 年 10 月考试的医学生中分别排在第 92.8%、99.5%和 92.6%的百分位。与仅通过三次考试中的一次的 GPT-3.5 相比，这一表现提高了 27%。虽然 GPT-3.5 在精神病学问题上表现出色，但 GPT-4 在内科和外科方面表现出色，但在学术研究方面表现不佳。

结论

研究结果突出表明，ChatGPT 从中等水平（GPT-3.5）到高水平（GPT-4）在回答德国医学执照考试问题方面的能力有了显著提高。虽然 GPT-3.5 不够精确和一致，但它具有很大的潜力可以改善医学教育和患者护理，前提是接受过医学培训的用户对其结果进行批判性评估。随着未来人工智能工具可能取代搜索引擎，还需要进行更多非专业问题的研究，以评估 ChatGPT 对普通民众的安全性和准确性。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

GPT-3.5 和 GPT-4 与医学生在书面德语文凭考试中的表现比较：观察性研究。

Comparison of the Performance of GPT-3.5 and GPT-4 With That of Medical Students on the Written German Medical Licensing Examination: Observational Study.

机构信息

出版信息

BACKGROUND

OBJECTIVE

METHODS

RESULTS

CONCLUSIONS

背景

目的

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

GPT-3.5 和 GPT-4 与医学生在书面德语文凭考试中的表现比较：观察性研究。

Comparison of the Performance of GPT-3.5 and GPT-4 With That of Medical Students on the Written German Medical Licensing Examination: Observational Study.

机构信息

出版信息

BACKGROUND

OBJECTIVE

METHODS

RESULTS

CONCLUSIONS

背景

目的

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献