与医学知识问题相比，ChatGPT在USMLE风格的伦理问题上表现更差。

ChatGPT Performs Worse on USMLE-Style Ethics Questions Compared to Medical Knowledge Questions.

作者信息

Danehy Tessa, Hecht Jessica, Kentis Sabrina, Schechter Clyde B, Jariwala Sunit P

机构信息

Albert Einstein College of Medicine, Montefiore Medical Center, Bronx, New York, United States.

Department of Family and Social Medicine, Albert Einstein College of Medicine, Bronx, New York, United States.

出版信息

Appl Clin Inform. 2024 Oct;15(5):1049-1055. doi: 10.1055/a-2405-0138. Epub 2024 Aug 29.

DOI:10.1055/a-2405-0138

PMID:39209308

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11617073/

Abstract

OBJECTIVES

The main objective of this study is to evaluate the ability of the Large Language Model Chat Generative Pre-Trained Transformer (ChatGPT) to accurately answer the United States Medical Licensing Examination (USMLE) board-style medical ethics questions compared to medical knowledge-based questions. This study has the additional objectives of comparing the overall accuracy of GPT-3.5 to GPT-4 and assessing the variability of responses given by each version.

METHODS

Using AMBOSS, a third-party USMLE Step Exam test prep service, we selected one group of 27 medical ethics questions and a second group of 27 medical knowledge questions matched on question difficulty for medical students. We ran 30 trials asking these questions on GPT-3.5 and GPT-4 and recorded the output. A random-effects linear probability regression model evaluated accuracy and a Shannon entropy calculation evaluated response variation.

RESULTS

Both versions of ChatGPT demonstrated worse performance on medical ethics questions compared to medical knowledge questions. GPT-4 performed 18% points ( < 0.05) worse on medical ethics questions compared to medical knowledge questions and GPT-3.5 performed 7% points ( = 0.41) worse. GPT-4 outperformed GPT-3.5 by 22% points ( < 0.001) on medical ethics and 33% points ( < 0.001) on medical knowledge. GPT-4 also exhibited an overall lower Shannon entropy for medical ethics and medical knowledge questions (0.21 and 0.11, respectively) than GPT-3.5 (0.59 and 0.55, respectively) which indicates lower variability in response.

CONCLUSION

Both versions of ChatGPT performed more poorly on medical ethics questions compared to medical knowledge questions. GPT-4 significantly outperformed GPT-3.5 on overall accuracy and exhibited a significantly lower response variability in answer choices. This underscores the need for ongoing assessment of ChatGPT versions for medical education.

摘要

目的

本研究的主要目的是评估大型语言模型聊天生成预训练变换器（ChatGPT）与基于医学知识的问题相比，准确回答美国医学执照考试（USMLE）委员会风格医学伦理问题的能力。本研究的其他目的包括比较GPT-3.5和GPT-4的总体准确率，并评估每个版本给出的回答的可变性。

方法

我们使用第三方USMLE Step考试备考服务机构AMBOSS，为医学生选择了一组27道医学伦理问题和另一组难度匹配的27道医学知识问题。我们在GPT-3.5和GPT-4上对这些问题进行了30次提问并记录输出。随机效应线性概率回归模型评估准确率，香农熵计算评估回答变化。

结果

与医学知识问题相比，ChatGPT的两个版本在医学伦理问题上表现更差。与医学知识问题相比，GPT-4在医学伦理问题上的表现差18个百分点（P<0.05），GPT-3.5差7个百分点（P=0.41）。在医学伦理方面，GPT-4比GPT-3.5表现好22个百分点（P<0.001），在医学知识方面好33个百分点（P<0.001）。对于医学伦理和医学知识问题，GPT-4的香农熵总体也低于GPT-3.5（分别为0.21和0.11），这表明回答的可变性更低。

结论

与医学知识问题相比，ChatGPT的两个版本在医学伦理问题上表现更差。GPT-4在总体准确率上显著优于GPT-3.5，并且在答案选择上表现出显著更低的回答可变性。这突出了对用于医学教育的ChatGPT版本进行持续评估的必要性。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

与医学知识问题相比，ChatGPT在USMLE风格的伦理问题上表现更差。

ChatGPT Performs Worse on USMLE-Style Ethics Questions Compared to Medical Knowledge Questions.

作者信息

机构信息

出版信息

OBJECTIVES

METHODS

RESULTS

CONCLUSION

目的

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

与医学知识问题相比，ChatGPT在USMLE风格的伦理问题上表现更差。

ChatGPT Performs Worse on USMLE-Style Ethics Questions Compared to Medical Knowledge Questions.

作者信息

机构信息

出版信息

OBJECTIVES

METHODS

RESULTS

CONCLUSION

目的

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献