• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

与医学知识问题相比,ChatGPT在USMLE风格的伦理问题上表现更差。

ChatGPT Performs Worse on USMLE-Style Ethics Questions Compared to Medical Knowledge Questions.

作者信息

Danehy Tessa, Hecht Jessica, Kentis Sabrina, Schechter Clyde B, Jariwala Sunit P

机构信息

Albert Einstein College of Medicine, Montefiore Medical Center, Bronx, New York, United States.

Department of Family and Social Medicine, Albert Einstein College of Medicine, Bronx, New York, United States.

出版信息

Appl Clin Inform. 2024 Oct;15(5):1049-1055. doi: 10.1055/a-2405-0138. Epub 2024 Aug 29.

DOI:10.1055/a-2405-0138
PMID:39209308
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11617073/
Abstract

OBJECTIVES

The main objective of this study is to evaluate the ability of the Large Language Model Chat Generative Pre-Trained Transformer (ChatGPT) to accurately answer the United States Medical Licensing Examination (USMLE) board-style medical ethics questions compared to medical knowledge-based questions. This study has the additional objectives of comparing the overall accuracy of GPT-3.5 to GPT-4 and assessing the variability of responses given by each version.

METHODS

Using AMBOSS, a third-party USMLE Step Exam test prep service, we selected one group of 27 medical ethics questions and a second group of 27 medical knowledge questions matched on question difficulty for medical students. We ran 30 trials asking these questions on GPT-3.5 and GPT-4 and recorded the output. A random-effects linear probability regression model evaluated accuracy and a Shannon entropy calculation evaluated response variation.

RESULTS

Both versions of ChatGPT demonstrated worse performance on medical ethics questions compared to medical knowledge questions. GPT-4 performed 18% points ( < 0.05) worse on medical ethics questions compared to medical knowledge questions and GPT-3.5 performed 7% points ( = 0.41) worse. GPT-4 outperformed GPT-3.5 by 22% points ( < 0.001) on medical ethics and 33% points ( < 0.001) on medical knowledge. GPT-4 also exhibited an overall lower Shannon entropy for medical ethics and medical knowledge questions (0.21 and 0.11, respectively) than GPT-3.5 (0.59 and 0.55, respectively) which indicates lower variability in response.

CONCLUSION

Both versions of ChatGPT performed more poorly on medical ethics questions compared to medical knowledge questions. GPT-4 significantly outperformed GPT-3.5 on overall accuracy and exhibited a significantly lower response variability in answer choices. This underscores the need for ongoing assessment of ChatGPT versions for medical education.

摘要

目的

本研究的主要目的是评估大型语言模型聊天生成预训练变换器(ChatGPT)与基于医学知识的问题相比,准确回答美国医学执照考试(USMLE)委员会风格医学伦理问题的能力。本研究的其他目的包括比较GPT-3.5和GPT-4的总体准确率,并评估每个版本给出的回答的可变性。

方法

我们使用第三方USMLE Step考试备考服务机构AMBOSS,为医学生选择了一组27道医学伦理问题和另一组难度匹配的27道医学知识问题。我们在GPT-3.5和GPT-4上对这些问题进行了30次提问并记录输出。随机效应线性概率回归模型评估准确率,香农熵计算评估回答变化。

结果

与医学知识问题相比,ChatGPT的两个版本在医学伦理问题上表现更差。与医学知识问题相比,GPT-4在医学伦理问题上的表现差18个百分点(P<0.05),GPT-3.5差7个百分点(P=0.41)。在医学伦理方面,GPT-4比GPT-3.5表现好22个百分点(P<0.001),在医学知识方面好33个百分点(P<0.001)。对于医学伦理和医学知识问题,GPT-4的香农熵总体也低于GPT-3.5(分别为0.21和0.11),这表明回答的可变性更低。

结论

与医学知识问题相比,ChatGPT的两个版本在医学伦理问题上表现更差。GPT-4在总体准确率上显著优于GPT-3.5,并且在答案选择上表现出显著更低的回答可变性。这突出了对用于医学教育的ChatGPT版本进行持续评估的必要性。

相似文献

1
ChatGPT Performs Worse on USMLE-Style Ethics Questions Compared to Medical Knowledge Questions.与医学知识问题相比,ChatGPT在USMLE风格的伦理问题上表现更差。
Appl Clin Inform. 2024 Oct;15(5):1049-1055. doi: 10.1055/a-2405-0138. Epub 2024 Aug 29.
2
Comparative Performance of Medical Students, ChatGPT-3.5 and ChatGPT-4.0 in Answering Questions From a Brazilian National Medical Exam: Cross-Sectional Questionnaire Study.医学生、ChatGPT-3.5和ChatGPT-4.0在回答巴西国家医学考试问题中的表现比较:横断面问卷调查研究
JMIR AI. 2025 May 8;4:e66552. doi: 10.2196/66552.
3
Performance Assessment of GPT 4.0 on the Japanese Medical Licensing Examination.GPT 4.0在日本医师执照考试中的性能评估。
Curr Med Sci. 2024 Dec;44(6):1148-1154. doi: 10.1007/s11596-024-2932-9. Epub 2024 Oct 26.
4
Benchmarking Vision Capabilities of Large Language Models in Surgical Examination Questions.大型语言模型在外科检查问题中的视觉能力基准测试
J Surg Educ. 2025 Apr;82(4):103442. doi: 10.1016/j.jsurg.2025.103442. Epub 2025 Feb 9.
5
Performance of Large Language Models in the Non-English Context: Qualitative Study of Models Trained on Different Languages in Chinese Medical Examinations.大语言模型在非英语环境中的表现:对在中国医学考试中使用不同语言训练的模型的定性研究
JMIR Med Inform. 2025 Jun 27;13:e69485. doi: 10.2196/69485.
6
Artificial Intelligence in Orthopaedics: Performance of ChatGPT on Text and Image Questions on a Complete AAOS Orthopaedic In-Training Examination (OITE).人工智能在骨科领域的应用:ChatGPT 在 AAOS 骨科住院医师培训考试(OITE)全题文本和图像问题上的表现。
J Surg Educ. 2024 Nov;81(11):1645-1649. doi: 10.1016/j.jsurg.2024.08.002. Epub 2024 Sep 14.
7
Comparative analysis of LLMs performance in medical embryology: A cross-platform study of ChatGPT, Claude, Gemini, and Copilot.大语言模型在医学胚胎学中的性能比较分析:ChatGPT、Claude、Gemini和Copilot的跨平台研究
Anat Sci Educ. 2025 May 11. doi: 10.1002/ase.70044.
8
Comparison of ChatGPT and Internet Research for Clinical Research and Decision-Making in Occupational Medicine: Randomized Controlled Trial.ChatGPT与互联网搜索用于职业医学临床研究和决策的比较:随机对照试验
JMIR Form Res. 2025 May 20;9:e63857. doi: 10.2196/63857.
9
The potential of Generative Pre-trained Transformer 4 (GPT-4) to analyse medical notes in three different languages: a retrospective model-evaluation study.生成式预训练变换器4(GPT-4)分析三种不同语言医学笔记的潜力:一项回顾性模型评估研究。
Lancet Digit Health. 2025 Jan;7(1):e35-e43. doi: 10.1016/S2589-7500(24)00246-2.
10
Comparing ChatGPT-4 and a Paediatric Intensive Care Specialist in Responding to Medical Education Questions: A Multicenter Evaluation.比较ChatGPT-4与儿科重症监护专家对医学教育问题的回答:一项多中心评估。
J Paediatr Child Health. 2025 Jul;61(7):1084-1089. doi: 10.1111/jpc.70080. Epub 2025 May 7.

引用本文的文献

1
Advancements in AI Medical Education: Assessing ChatGPT's Performance on USMLE-Style Questions Across Topics and Difficulty Levels.人工智能医学教育的进展:评估ChatGPT在不同主题和难度级别的美国医师执照考试(USMLE)风格问题上的表现。
Cureus. 2024 Dec 24;16(12):e76309. doi: 10.7759/cureus.76309. eCollection 2024 Dec.

本文引用的文献

1
A Survey of Clinicians' Views of the Utility of Large Language Models.临床医生对大型语言模型实用性的看法调查。
Appl Clin Inform. 2024 Mar;15(2):306-312. doi: 10.1055/a-2281-7092. Epub 2024 Mar 5.
2
Correction: How Does ChatGPT Perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment.更正:ChatGPT在美国医师执照考试(USMLE)中的表现如何?大语言模型对医学教育和知识评估的影响。
JMIR Med Educ. 2024 Feb 27;10:e57594. doi: 10.2196/57594.
3
Performance of ChatGPT on Nephrology Test Questions.ChatGPT 在肾病学试题上的表现。
Clin J Am Soc Nephrol. 2024 Jan 1;19(1):35-43. doi: 10.2215/CJN.0000000000000330. Epub 2023 Oct 18.
4
ChatGPT-4: An assessment of an upgraded artificial intelligence chatbot in the United States Medical Licensing Examination.ChatGPT-4:美国医师执照考试中人工智能聊天机器人的升级评估。
Med Teach. 2024 Mar;46(3):366-372. doi: 10.1080/0142159X.2023.2249588. Epub 2023 Oct 15.
5
Unveiling the ChatGPT phenomenon: Evaluating the consistency and accuracy of endodontic question answers.揭开ChatGPT现象的面纱:评估牙髓病学问题答案的一致性和准确性。
Int Endod J. 2024 Jan;57(1):108-113. doi: 10.1111/iej.13985. Epub 2023 Oct 9.
6
Prompt Engineering as an Important Emerging Skill for Medical Professionals: Tutorial.医学专业人员的新兴技能:提示工程教程
J Med Internet Res. 2023 Oct 4;25:e50638. doi: 10.2196/50638.
7
Practical Applications of ChatGPT in Undergraduate Medical Education.ChatGPT在本科医学教育中的实际应用
J Med Educ Curric Dev. 2023 May 24;10:23821205231178449. doi: 10.1177/23821205231178449. eCollection 2023 Jan-Dec.
8
Chat Generative Pretrained Transformer Fails the Multiple-Choice American College of Gastroenterology Self-Assessment Test.ChatGPT 答错多项选择题美国胃肠病学院自测题
Am J Gastroenterol. 2023 Dec 1;118(12):2280-2282. doi: 10.14309/ajg.0000000000002320. Epub 2023 May 22.
9
Performance of ChatGPT on a Radiology Board-style Examination: Insights into Current Strengths and Limitations.ChatGPT 在放射科 Board 考试中的表现:当前优势和局限性的深入了解。
Radiology. 2023 Jun;307(5):e230582. doi: 10.1148/radiol.230582. Epub 2023 May 16.
10
Performance of ChatGPT on the Plastic Surgery Inservice Training Examination.ChatGPT 在整形外科学在职培训考试中的表现。
Aesthet Surg J. 2023 Nov 16;43(12):NP1078-NP1082. doi: 10.1093/asj/sjad128.