Institute for Clinical Chemistry, University Hospital Cologne, Cologne, Germany.
Department of General Surgery, Visceral, Thoracic and Vascular Surgery, University Hospital Greifswald, Greifswald, Germany.
JMIR Med Educ. 2024 Feb 8;10:e50965. doi: 10.2196/50965.
The potential of artificial intelligence (AI)-based large language models, such as ChatGPT, has gained significant attention in the medical field. This enthusiasm is driven not only by recent breakthroughs and improved accessibility, but also by the prospect of democratizing medical knowledge and promoting equitable health care. However, the performance of ChatGPT is substantially influenced by the input language, and given the growing public trust in this AI tool compared to that in traditional sources of information, investigating its medical accuracy across different languages is of particular importance.
This study aimed to compare the performance of GPT-3.5 and GPT-4 with that of medical students on the written German medical licensing examination.
To assess GPT-3.5's and GPT-4's medical proficiency, we used 937 original multiple-choice questions from 3 written German medical licensing examinations in October 2021, April 2022, and October 2022.
GPT-4 achieved an average score of 85% and ranked in the 92.8th, 99.5th, and 92.6th percentiles among medical students who took the same examinations in October 2021, April 2022, and October 2022, respectively. This represents a substantial improvement of 27% compared to GPT-3.5, which only passed 1 out of the 3 examinations. While GPT-3.5 performed well in psychiatry questions, GPT-4 exhibited strengths in internal medicine and surgery but showed weakness in academic research.
The study results highlight ChatGPT's remarkable improvement from moderate (GPT-3.5) to high competency (GPT-4) in answering medical licensing examination questions in German. While GPT-4's predecessor (GPT-3.5) was imprecise and inconsistent, it demonstrates considerable potential to improve medical education and patient care, provided that medically trained users critically evaluate its results. As the replacement of search engines by AI tools seems possible in the future, further studies with nonprofessional questions are needed to assess the safety and accuracy of ChatGPT for the general population.
基于人工智能(AI)的大型语言模型,如 ChatGPT,在医学领域引起了广泛关注。这种热情不仅源于最近的突破和改进的可及性,还源于使医学知识民主化和促进公平医疗保健的前景。然而,ChatGPT 的性能受到输入语言的极大影响,并且鉴于公众对这种人工智能工具的信任度日益提高,与传统信息来源相比,调查其在不同语言中的医学准确性尤为重要。
本研究旨在比较 GPT-3.5 和 GPT-4 在德国医学执照考试书面考试中的表现与医学生的表现。
为了评估 GPT-3.5 的医学能力,我们使用了 2021 年 10 月、2022 年 4 月和 2022 年 10 月的三次书面德国医学执照考试中的 937 个原始多项选择题。
GPT-4 的平均得分为 85%,在参加 2021 年 10 月、2022 年 4 月和 2022 年 10 月考试的医学生中分别排在第 92.8%、99.5%和 92.6%的百分位。与仅通过三次考试中的一次的 GPT-3.5 相比,这一表现提高了 27%。虽然 GPT-3.5 在精神病学问题上表现出色,但 GPT-4 在内科和外科方面表现出色,但在学术研究方面表现不佳。
研究结果突出表明,ChatGPT 从中等水平(GPT-3.5)到高水平(GPT-4)在回答德国医学执照考试问题方面的能力有了显著提高。虽然 GPT-3.5 不够精确和一致,但它具有很大的潜力可以改善医学教育和患者护理,前提是接受过医学培训的用户对其结果进行批判性评估。随着未来人工智能工具可能取代搜索引擎,还需要进行更多非专业问题的研究,以评估 ChatGPT 对普通民众的安全性和准确性。