Khan Ali A, Khan Ali R, Munshi Saminah, Dandapani Hari, Jimale Mohamed, Bogni Franck M, Khawaja Hussain
Warren Alpert Medical School, Brown University, Providence, Rhode Island, USA
The University of Texas Medical Branch at Galveston, Galveston, Texas, USA.
J Med Ethics. 2025 Jan 25. doi: 10.1136/jme-2024-110240.
The integration of artificial intelligence (AI) into healthcare introduces innovative possibilities but raises ethical, legal and professional concerns. Assessing the performance of AI in core components of the United States Medical Licensing Examination (USMLE), such as communication skills, ethics, empathy and professionalism, is crucial. This study evaluates how well ChatGPT versions 3.5 and 4.0 handle complex medical scenarios using USMLE-Rx, AMBOSS and UWorld question banks, aiming to understand its ability to navigate patient interactions according to medical ethics and standards.
We compiled 273 questions from AMBOSS, USMLE-Rx and UWorld, focusing on communication, social sciences, healthcare policy and ethics. GPT-3.5 and GPT-4 were tasked with answering and justifying their choices in new chat sessions to minimise model interference. Responses were compared against question bank rationales and average student performance to evaluate AI effectiveness in medical ethical decision-making.
GPT-3.5 answered 38.9% correctly in AMBOSS, 54.1% in USMLE-Rx and 57.4% in UWorld, with rationale accuracy rates of 83.3%, 90.0% and 87.0%, respectively. GPT-4 answered 75.9% correctly in AMBOSS, 64.9% in USMLE-Rx and 79.6% in UWorld, with rationale accuracy rates of 85.4%, 88.9%, and 98.8%, respectively. Both versions generally scored below average student performance, except GPT-4 in UWorld.
ChatGPT, particularly version 4.0, shows potential in navigating ethical and interpersonal medical scenarios. However, human reasoning currently surpasses AI in average performance. Continued development and training of AI systems can enhance proficiency in these critical healthcare aspects.
将人工智能(AI)整合到医疗保健领域带来了创新的可能性,但也引发了伦理、法律和专业方面的担忧。评估人工智能在美国医师执照考试(USMLE)核心组成部分中的表现,如沟通技巧、伦理、同理心和专业素养,至关重要。本研究使用USMLE-Rx、AMBOSS和UWorld题库评估ChatGPT 3.5版和4.0版处理复杂医疗场景的能力,旨在了解其根据医学伦理和标准进行患者互动的能力。
我们从AMBOSS、USMLE-Rx和UWorld中收集了273个问题,重点关注沟通、社会科学、医疗政策和伦理。要求GPT-3.5和GPT-4在新的聊天会话中回答问题并为其选择提供理由,以尽量减少模型干扰。将回答与题库原理和学生平均表现进行比较,以评估人工智能在医学伦理决策中的有效性。
GPT-3.5在AMBOSS中的正确回答率为38.9%,在USMLE-Rx中为54.1%,在UWorld中为57.4%,其原理准确率分别为83.3%、90.0%和87.0%。GPT-4在AMBOSS中的正确回答率为75.9%,在USMLE-Rx中为64.9%,在UWorld中为79.6%,其原理准确率分别为85.4%、88.9%和98.8%。除了GPT-4在UWorld中的表现外,两个版本的得分总体上均低于学生平均水平。
ChatGPT,特别是4.0版在处理伦理和人际医疗场景方面显示出潜力。然而,目前人类推理在平均表现上超过了人工智能。人工智能系统的持续开发和训练可以提高在这些关键医疗保健方面的熟练程度。