• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

模型演进和系统角色对 ChatGPT 在中文医师资格考试中表现的影响:对比研究。

Influence of Model Evolution and System Roles on ChatGPT's Performance in Chinese Medical Licensing Exams: Comparative Study.

机构信息

Department of Ophthalmology, Henan Eye Hospital, Henan Provincial People's Hospital, Zhengzhou, China.

Eye Institute, Henan Academy of Innovations in Medical Science, Zhengzhou, China.

出版信息

JMIR Med Educ. 2024 Aug 13;10:e52784. doi: 10.2196/52784.

DOI:10.2196/52784
PMID:39140269
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11336778/
Abstract

BACKGROUND

With the increasing application of large language models like ChatGPT in various industries, its potential in the medical domain, especially in standardized examinations, has become a focal point of research.

OBJECTIVE

The aim of this study is to assess the clinical performance of ChatGPT, focusing on its accuracy and reliability in the Chinese National Medical Licensing Examination (CNMLE).

METHODS

The CNMLE 2022 question set, consisting of 500 single-answer multiple choices questions, were reclassified into 15 medical subspecialties. Each question was tested 8 to 12 times in Chinese on the OpenAI platform from April 24 to May 15, 2023. Three key factors were considered: the version of GPT-3.5 and 4.0, the prompt's designation of system roles tailored to medical subspecialties, and repetition for coherence. A passing accuracy threshold was established as 60%. The χ2 tests and κ values were employed to evaluate the model's accuracy and consistency.

RESULTS

GPT-4.0 achieved a passing accuracy of 72.7%, which was significantly higher than that of GPT-3.5 (54%; P<.001). The variability rate of repeated responses from GPT-4.0 was lower than that of GPT-3.5 (9% vs 19.5%; P<.001). However, both models showed relatively good response coherence, with κ values of 0.778 and 0.610, respectively. System roles numerically increased accuracy for both GPT-4.0 (0.3%-3.7%) and GPT-3.5 (1.3%-4.5%), and reduced variability by 1.7% and 1.8%, respectively (P>.05). In subgroup analysis, ChatGPT achieved comparable accuracy among different question types (P>.05). GPT-4.0 surpassed the accuracy threshold in 14 of 15 subspecialties, while GPT-3.5 did so in 7 of 15 on the first response.

CONCLUSIONS

GPT-4.0 passed the CNMLE and outperformed GPT-3.5 in key areas such as accuracy, consistency, and medical subspecialty expertise. Adding a system role insignificantly enhanced the model's reliability and answer coherence. GPT-4.0 showed promising potential in medical education and clinical practice, meriting further study.

摘要

背景

随着 ChatGPT 等大型语言模型在各行业中的应用日益广泛,其在医学领域的潜力,尤其是在标准化考试中的应用,已成为研究焦点。

目的

本研究旨在评估 ChatGPT 的临床性能,重点关注其在国家医师资格考试(CNMLE)中的准确性和可靠性。

方法

将 2022 年 CNMLE 的 500 道单项选择题集重新分为 15 个医学亚专科。2023 年 4 月 24 日至 5 月 15 日,我们在 OpenAI 平台上以中文对每个问题进行了 8 至 12 次测试。考虑了三个关键因素:GPT-3.5 和 4.0 的版本、针对医学亚专科定制的系统角色提示以及为连贯性进行的重复。设定了 60%的及格准确率作为标准。采用 χ2 检验和κ 值评估模型的准确性和一致性。

结果

GPT-4.0 的及格准确率为 72.7%,显著高于 GPT-3.5(54%;P<.001)。GPT-4.0 重复响应的变异性率低于 GPT-3.5(9%比 19.5%;P<.001)。然而,两个模型的响应一致性都较好,κ 值分别为 0.778 和 0.610。系统角色对 GPT-4.0(0.3%-3.7%)和 GPT-3.5(1.3%-4.5%)的准确性均有提升,变异性分别降低了 1.7%和 1.8%(P>.05)。在亚组分析中,ChatGPT 在不同题型中表现出相当的准确性(P>.05)。GPT-4.0 在 15 个亚专科中的 14 个首次回答中都超过了及格准确率,而 GPT-3.5 则在 7 个中超过了及格准确率。

结论

GPT-4.0 通过了 CNMLE,在准确性、一致性和医学亚专科专业知识等关键领域优于 GPT-3.5。添加系统角色对模型的可靠性和答案一致性略有提升。GPT-4.0 在医学教育和临床实践中具有广阔的应用前景,值得进一步研究。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8bcf/11336778/7040711c7c31/mededu-v10-e52784-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8bcf/11336778/5486eef78ae1/mededu-v10-e52784-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8bcf/11336778/7040711c7c31/mededu-v10-e52784-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8bcf/11336778/5486eef78ae1/mededu-v10-e52784-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8bcf/11336778/7040711c7c31/mededu-v10-e52784-g002.jpg

相似文献

1
Influence of Model Evolution and System Roles on ChatGPT's Performance in Chinese Medical Licensing Exams: Comparative Study.模型演进和系统角色对 ChatGPT 在中文医师资格考试中表现的影响:对比研究。
JMIR Med Educ. 2024 Aug 13;10:e52784. doi: 10.2196/52784.
2
Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: Systematic Review and Meta-Analysis.ChatGPT 在全球医学执照考试不同版本中的表现:系统评价和荟萃分析。
J Med Internet Res. 2024 Jul 25;26:e60807. doi: 10.2196/60807.
3
Performance and exploration of ChatGPT in medical examination, records and education in Chinese: Pave the way for medical AI.ChatGPT 在中文体检、病历和教育方面的表现和探索:为医疗 AI 铺平道路。
Int J Med Inform. 2023 Sep;177:105173. doi: 10.1016/j.ijmedinf.2023.105173. Epub 2023 Aug 4.
4
Performance of ChatGPT-3.5 and GPT-4 in national licensing examinations for medicine, pharmacy, dentistry, and nursing: a systematic review and meta-analysis.ChatGPT-3.5 和 GPT-4 在医学、药学、牙科和护理国家执照考试中的表现:系统评价和荟萃分析。
BMC Med Educ. 2024 Sep 16;24(1):1013. doi: 10.1186/s12909-024-05944-8.
5
How Does ChatGPT Perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment.ChatGPT在美国医师执照考试(USMLE)中的表现如何?大语言模型对医学教育和知识评估的影响。
JMIR Med Educ. 2023 Feb 8;9:e45312. doi: 10.2196/45312.
6
Exploring the Performance of ChatGPT Versions 3.5, 4, and 4 With Vision in the Chilean Medical Licensing Examination: Observational Study.探讨 ChatGPT 版本 3.5、4 和 4 与 Vision 在智利医师执照考试中的表现:观察性研究。
JMIR Med Educ. 2024 Apr 29;10:e55048. doi: 10.2196/55048.
7
Accuracy of ChatGPT on Medical Questions in the National Medical Licensing Examination in Japan: Evaluation Study.ChatGPT在日本国家医师资格考试医学问题上的准确性:评估研究
JMIR Form Res. 2023 Oct 13;7:e48023. doi: 10.2196/48023.
8
Capability of GPT-4V(ision) in the Japanese National Medical Licensing Examination: Evaluation Study.GPT-4V(视觉)在日本国家医师资格考试中的能力:评估研究。
JMIR Med Educ. 2024 Mar 12;10:e54393. doi: 10.2196/54393.
9
Assessing question characteristic influences on ChatGPT's performance and response-explanation consistency: Insights from Taiwan's Nursing Licensing Exam.评估问题特征对 ChatGPT 表现和回应解释一致性的影响:来自台湾护理执照考试的见解。
Int J Nurs Stud. 2024 May;153:104717. doi: 10.1016/j.ijnurstu.2024.104717. Epub 2024 Feb 8.
10
Performance of ChatGPT on Nursing Licensure Examinations in the United States and China: Cross-Sectional Study.ChatGPT 在中美护理执照考试中的表现:横断面研究。
JMIR Med Educ. 2024 Oct 3;10:e52746. doi: 10.2196/52746.

引用本文的文献

1
Postoperative complication management: How do large language models measure up to human expertise?术后并发症管理:大语言模型与人类专业知识相比如何?
PLOS Digit Health. 2025 Aug 1;4(8):e0000933. doi: 10.1371/journal.pdig.0000933. eCollection 2025 Aug.
2
Performance of DeepSeek-R1 and ChatGPT-4o on the Chinese National Medical Licensing Examination: A Comparative Study.DeepSeek-R1和ChatGPT-4o在中国国家医师资格考试中的表现:一项比较研究。
J Med Syst. 2025 Jun 3;49(1):74. doi: 10.1007/s10916-025-02213-z.
3
A Comparative Analysis of GPT-4o and ERNIE Bot in a Chinese Radiation Oncology Exam.

本文引用的文献

1
Large language models in health care: Development, applications, and challenges.医疗保健领域的大语言模型:发展、应用与挑战。
Health Care Sci. 2023 Jul 24;2(4):255-263. doi: 10.1002/hcs2.61. eCollection 2023 Aug.
2
Assessing question characteristic influences on ChatGPT's performance and response-explanation consistency: Insights from Taiwan's Nursing Licensing Exam.评估问题特征对 ChatGPT 表现和回应解释一致性的影响:来自台湾护理执照考试的见解。
Int J Nurs Stud. 2024 May;153:104717. doi: 10.1016/j.ijnurstu.2024.104717. Epub 2024 Feb 8.
3
Evaluating ChatGPT Performance on the Orthopaedic In-Training Examination.
GPT-4o与文心一言在中国放射肿瘤学考试中的对比分析
J Cancer Educ. 2025 May 26. doi: 10.1007/s13187-025-02652-9.
4
Accuracy of Large Language Models When Answering Clinical Research Questions: Systematic Review and Network Meta-Analysis.大型语言模型回答临床研究问题的准确性:系统评价与网络荟萃分析
J Med Internet Res. 2025 Apr 30;27:e64486. doi: 10.2196/64486.
评估ChatGPT在骨科住院医师培训考试中的表现。
JB JS Open Access. 2023 Sep 8;8(3). doi: 10.2106/JBJS.OA.23.00056. eCollection 2023 Jul-Sep.
4
GPT-4 Artificial Intelligence Model Outperforms ChatGPT, Medical Students, and Neurosurgery Residents on Neurosurgery Written Board-Like Questions.GPT-4人工智能模型在类似神经外科书面考试的问题上表现优于ChatGPT、医学生和神经外科住院医师。
World Neurosurg. 2023 Nov;179:e160-e165. doi: 10.1016/j.wneu.2023.08.042. Epub 2023 Aug 18.
5
ChatGPT Performs on the Chinese National Medical Licensing Examination.ChatGPT 通过中国医师资格考试。
J Med Syst. 2023 Aug 15;47(1):86. doi: 10.1007/s10916-023-01961-0.
6
Performance of ChatGPT and GPT-4 on Neurosurgery Written Board Examinations.ChatGPT和GPT-4在神经外科笔试中的表现。
Neurosurgery. 2023 Dec 1;93(6):1353-1365. doi: 10.1227/neu.0000000000002632. Epub 2023 Aug 15.
7
ChatGPT performance in the medical specialty exam: An observational study.ChatGPT 在医学专业考试中的表现:一项观察性研究。
Medicine (Baltimore). 2023 Aug 11;102(32):e34673. doi: 10.1097/MD.0000000000034673.
8
ChatGPT-A double-edged sword for healthcare education? Implications for assessments of dental students.ChatGPT——医学教育的双刃剑?对牙科学生评估的影响。
Eur J Dent Educ. 2024 Feb;28(1):206-211. doi: 10.1111/eje.12937. Epub 2023 Aug 7.
9
Can ChatGPT pass the thoracic surgery exam?ChatGPT 能通过胸外科考试吗?
Am J Med Sci. 2023 Oct;366(4):291-295. doi: 10.1016/j.amjms.2023.08.001. Epub 2023 Aug 6.
10
Performance and exploration of ChatGPT in medical examination, records and education in Chinese: Pave the way for medical AI.ChatGPT 在中文体检、病历和教育方面的表现和探索:为医疗 AI 铺平道路。
Int J Med Inform. 2023 Sep;177:105173. doi: 10.1016/j.ijmedinf.2023.105173. Epub 2023 Aug 4.