评估ChatGPT在医学伦理决策中的表现：一项基于美国医师执照考试情景的比较研究。

Assessing the performance of ChatGPT in medical ethical decision-making: a comparative study with USMLE-based scenarios.

作者信息

Khan Ali A, Khan Ali R, Munshi Saminah, Dandapani Hari, Jimale Mohamed, Bogni Franck M, Khawaja Hussain

机构信息

Warren Alpert Medical School, Brown University, Providence, Rhode Island, USA

The University of Texas Medical Branch at Galveston, Galveston, Texas, USA.

出版信息

J Med Ethics. 2025 Jan 25. doi: 10.1136/jme-2024-110240.

DOI:10.1136/jme-2024-110240

PMID:39863417

Abstract

INTRODUCTION

The integration of artificial intelligence (AI) into healthcare introduces innovative possibilities but raises ethical, legal and professional concerns. Assessing the performance of AI in core components of the United States Medical Licensing Examination (USMLE), such as communication skills, ethics, empathy and professionalism, is crucial. This study evaluates how well ChatGPT versions 3.5 and 4.0 handle complex medical scenarios using USMLE-Rx, AMBOSS and UWorld question banks, aiming to understand its ability to navigate patient interactions according to medical ethics and standards.

METHODS

We compiled 273 questions from AMBOSS, USMLE-Rx and UWorld, focusing on communication, social sciences, healthcare policy and ethics. GPT-3.5 and GPT-4 were tasked with answering and justifying their choices in new chat sessions to minimise model interference. Responses were compared against question bank rationales and average student performance to evaluate AI effectiveness in medical ethical decision-making.

RESULTS

GPT-3.5 answered 38.9% correctly in AMBOSS, 54.1% in USMLE-Rx and 57.4% in UWorld, with rationale accuracy rates of 83.3%, 90.0% and 87.0%, respectively. GPT-4 answered 75.9% correctly in AMBOSS, 64.9% in USMLE-Rx and 79.6% in UWorld, with rationale accuracy rates of 85.4%, 88.9%, and 98.8%, respectively. Both versions generally scored below average student performance, except GPT-4 in UWorld.

CONCLUSION

ChatGPT, particularly version 4.0, shows potential in navigating ethical and interpersonal medical scenarios. However, human reasoning currently surpasses AI in average performance. Continued development and training of AI systems can enhance proficiency in these critical healthcare aspects.

摘要

引言

将人工智能（AI）整合到医疗保健领域带来了创新的可能性，但也引发了伦理、法律和专业方面的担忧。评估人工智能在美国医师执照考试（USMLE）核心组成部分中的表现，如沟通技巧、伦理、同理心和专业素养，至关重要。本研究使用USMLE-Rx、AMBOSS和UWorld题库评估ChatGPT 3.5版和4.0版处理复杂医疗场景的能力，旨在了解其根据医学伦理和标准进行患者互动的能力。

方法

我们从AMBOSS、USMLE-Rx和UWorld中收集了273个问题，重点关注沟通、社会科学、医疗政策和伦理。要求GPT-3.5和GPT-4在新的聊天会话中回答问题并为其选择提供理由，以尽量减少模型干扰。将回答与题库原理和学生平均表现进行比较，以评估人工智能在医学伦理决策中的有效性。

结果

GPT-3.5在AMBOSS中的正确回答率为38.9%，在USMLE-Rx中为54.1%，在UWorld中为57.4%，其原理准确率分别为83.3%、90.0%和87.0%。GPT-4在AMBOSS中的正确回答率为75.9%，在USMLE-Rx中为64.9%，在UWorld中为79.6%，其原理准确率分别为85.4%、88.9%和98.8%。除了GPT-4在UWorld中的表现外，两个版本的得分总体上均低于学生平均水平。