Suppr超能文献

模型演进和系统角色对 ChatGPT 在中文医师资格考试中表现的影响:对比研究。

Influence of Model Evolution and System Roles on ChatGPT's Performance in Chinese Medical Licensing Exams: Comparative Study.

机构信息

Department of Ophthalmology, Henan Eye Hospital, Henan Provincial People's Hospital, Zhengzhou, China.

Eye Institute, Henan Academy of Innovations in Medical Science, Zhengzhou, China.

出版信息

JMIR Med Educ. 2024 Aug 13;10:e52784. doi: 10.2196/52784.

Abstract

BACKGROUND

With the increasing application of large language models like ChatGPT in various industries, its potential in the medical domain, especially in standardized examinations, has become a focal point of research.

OBJECTIVE

The aim of this study is to assess the clinical performance of ChatGPT, focusing on its accuracy and reliability in the Chinese National Medical Licensing Examination (CNMLE).

METHODS

The CNMLE 2022 question set, consisting of 500 single-answer multiple choices questions, were reclassified into 15 medical subspecialties. Each question was tested 8 to 12 times in Chinese on the OpenAI platform from April 24 to May 15, 2023. Three key factors were considered: the version of GPT-3.5 and 4.0, the prompt's designation of system roles tailored to medical subspecialties, and repetition for coherence. A passing accuracy threshold was established as 60%. The χ2 tests and κ values were employed to evaluate the model's accuracy and consistency.

RESULTS

GPT-4.0 achieved a passing accuracy of 72.7%, which was significantly higher than that of GPT-3.5 (54%; P<.001). The variability rate of repeated responses from GPT-4.0 was lower than that of GPT-3.5 (9% vs 19.5%; P<.001). However, both models showed relatively good response coherence, with κ values of 0.778 and 0.610, respectively. System roles numerically increased accuracy for both GPT-4.0 (0.3%-3.7%) and GPT-3.5 (1.3%-4.5%), and reduced variability by 1.7% and 1.8%, respectively (P>.05). In subgroup analysis, ChatGPT achieved comparable accuracy among different question types (P>.05). GPT-4.0 surpassed the accuracy threshold in 14 of 15 subspecialties, while GPT-3.5 did so in 7 of 15 on the first response.

CONCLUSIONS

GPT-4.0 passed the CNMLE and outperformed GPT-3.5 in key areas such as accuracy, consistency, and medical subspecialty expertise. Adding a system role insignificantly enhanced the model's reliability and answer coherence. GPT-4.0 showed promising potential in medical education and clinical practice, meriting further study.

摘要

背景

随着 ChatGPT 等大型语言模型在各行业中的应用日益广泛,其在医学领域的潜力,尤其是在标准化考试中的应用,已成为研究焦点。

目的

本研究旨在评估 ChatGPT 的临床性能,重点关注其在国家医师资格考试(CNMLE)中的准确性和可靠性。

方法

将 2022 年 CNMLE 的 500 道单项选择题集重新分为 15 个医学亚专科。2023 年 4 月 24 日至 5 月 15 日,我们在 OpenAI 平台上以中文对每个问题进行了 8 至 12 次测试。考虑了三个关键因素:GPT-3.5 和 4.0 的版本、针对医学亚专科定制的系统角色提示以及为连贯性进行的重复。设定了 60%的及格准确率作为标准。采用 χ2 检验和κ 值评估模型的准确性和一致性。

结果

GPT-4.0 的及格准确率为 72.7%,显著高于 GPT-3.5(54%;P<.001)。GPT-4.0 重复响应的变异性率低于 GPT-3.5(9%比 19.5%;P<.001)。然而,两个模型的响应一致性都较好,κ 值分别为 0.778 和 0.610。系统角色对 GPT-4.0(0.3%-3.7%)和 GPT-3.5(1.3%-4.5%)的准确性均有提升,变异性分别降低了 1.7%和 1.8%(P>.05)。在亚组分析中,ChatGPT 在不同题型中表现出相当的准确性(P>.05)。GPT-4.0 在 15 个亚专科中的 14 个首次回答中都超过了及格准确率,而 GPT-3.5 则在 7 个中超过了及格准确率。

结论

GPT-4.0 通过了 CNMLE,在准确性、一致性和医学亚专科专业知识等关键领域优于 GPT-3.5。添加系统角色对模型的可靠性和答案一致性略有提升。GPT-4.0 在医学教育和临床实践中具有广阔的应用前景,值得进一步研究。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8bcf/11336778/5486eef78ae1/mededu-v10-e52784-g001.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验