Department of Bone and Joint Surgery and Sports Medicine Center, The First Affiliated Hospital, Guangzhou, China.
Department of Joint Surgery and Sports Medicine, Zhuhai People's Hospital, Zhuhai City, China.
JMIR Med Educ. 2024 Oct 3;10:e52746. doi: 10.2196/52746.
The creation of large language models (LLMs) such as ChatGPT is an important step in the development of artificial intelligence, which shows great potential in medical education due to its powerful language understanding and generative capabilities. The purpose of this study was to quantitatively evaluate and comprehensively analyze ChatGPT's performance in handling questions for the National Nursing Licensure Examination (NNLE) in China and the United States, including the National Council Licensure Examination for Registered Nurses (NCLEX-RN) and the NNLE.
This study aims to examine how well LLMs respond to the NCLEX-RN and the NNLE multiple-choice questions (MCQs) in various language inputs. To evaluate whether LLMs can be used as multilingual learning assistance for nursing, and to assess whether they possess a repository of professional knowledge applicable to clinical nursing practice.
First, we compiled 150 NCLEX-RN Practical MCQs, 240 NNLE Theoretical MCQs, and 240 NNLE Practical MCQs. Then, the translation function of ChatGPT 3.5 was used to translate NCLEX-RN questions from English to Chinese and NNLE questions from Chinese to English. Finally, the original version and the translated version of the MCQs were inputted into ChatGPT 4.0, ChatGPT 3.5, and Google Bard. Different LLMs were compared according to the accuracy rate, and the differences between different language inputs were compared.
The accuracy rates of ChatGPT 4.0 for NCLEX-RN practical questions and Chinese-translated NCLEX-RN practical questions were 88.7% (133/150) and 79.3% (119/150), respectively. Despite the statistical significance of the difference (P=.03), the correct rate was generally satisfactory. Around 71.9% (169/235) of NNLE Theoretical MCQs and 69.1% (161/233) of NNLE Practical MCQs were correctly answered by ChatGPT 4.0. The accuracy of ChatGPT 4.0 in processing NNLE Theoretical MCQs and NNLE Practical MCQs translated into English was 71.5% (168/235; P=.92) and 67.8% (158/233; P=.77), respectively, and there was no statistically significant difference between the results of text input in different languages. ChatGPT 3.5 (NCLEX-RN P=.003, NNLE Theoretical P<.001, NNLE Practical P=.12) and Google Bard (NCLEX-RN P<.001, NNLE Theoretical P<.001, NNLE Practical P<.001) had lower accuracy rates for nursing-related MCQs than ChatGPT 4.0 in English input. English accuracy was higher when compared with ChatGPT 3.5's Chinese input, and the difference was statistically significant (NCLEX-RN P=.02, NNLE Practical P=.02). Whether submitted in Chinese or English, the MCQs from the NCLEX-RN and NNLE demonstrated that ChatGPT 4.0 had the highest number of unique correct responses and the lowest number of unique incorrect responses among the 3 LLMs.
This study, focusing on 618 nursing MCQs including NCLEX-RN and NNLE exams, found that ChatGPT 4.0 outperformed ChatGPT 3.5 and Google Bard in accuracy. It excelled in processing English and Chinese inputs, underscoring its potential as a valuable tool in nursing education and clinical decision-making.
ChatGPT 等大型语言模型的创建是人工智能发展的重要一步,由于其强大的语言理解和生成能力,在医学教育中具有很大的应用潜力。本研究的目的是定量评估和全面分析 ChatGPT 在处理中美两国护士执照考试(NNLE)问题的能力,包括美国全国注册护士执照考试(NCLEX-RN)和 NNLE。
本研究旨在考察大型语言模型在处理各种语言输入的 NCLEX-RN 和 NNLE 多选题(MCQ)方面的表现。评估大型语言模型是否可以作为护理的多语言学习辅助工具,以及评估它们是否具有适用于临床护理实践的专业知识库。
首先,我们编译了 150 个 NCLEX-RN 实践 MCQ、240 个 NNLE 理论 MCQ 和 240 个 NNLE 实践 MCQ。然后,使用 ChatGPT 3.5 的翻译功能将 NCLEX-RN 问题从英语翻译成中文,将 NNLE 问题从中文翻译成英语。最后,将 MCQ 的原始版本和翻译版本输入到 ChatGPT 4.0、ChatGPT 3.5 和 Google Bard 中。根据准确率比较不同的大型语言模型,并比较不同语言输入之间的差异。
ChatGPT 4.0 对 NCLEX-RN 实践问题和中文翻译的 NCLEX-RN 实践问题的准确率分别为 88.7%(133/150)和 79.3%(119/150)。尽管存在统计学意义上的差异(P=.03),但准确率总体上令人满意。约 71.9%(169/235)的 NNLE 理论 MCQ 和 69.1%(161/233)的 NNLE 实践 MCQ 被 ChatGPT 4.0 正确回答。ChatGPT 4.0 处理翻译成英语的 NNLE 理论 MCQ 和 NNLE 实践 MCQ 的准确率分别为 71.5%(168/235;P=.92)和 67.8%(158/233;P=.77),两种语言输入的结果之间没有统计学意义上的差异。ChatGPT 3.5(NCLEX-RN P=.003,NNLE 理论 P<.001,NNLE 实践 P=.12)和 Google Bard(NCLEX-RN P<.001,NNLE 理论 P<.001,NNLE 实践 P<.001)在英语输入时对护理相关 MCQ 的准确率低于 ChatGPT 4.0。与 ChatGPT 3.5 的中文输入相比,英语输入的准确率更高,差异具有统计学意义(NCLEX-RN P=.02,NNLE 实践 P=.02)。无论是以中文还是英文提交,NCLEX-RN 和 NNLE 的 MCQ 都表明,在 3 种大型语言模型中,ChatGPT 4.0 具有最高的独特正确答案数量和最低的独特错误答案数量。
本研究聚焦于包括 NCLEX-RN 和 NNLE 考试在内的 618 个护理 MCQ,发现 ChatGPT 4.0 在准确性方面优于 ChatGPT 3.5 和 Google Bard。它在处理英语和中文输入方面表现出色,这突显了它在护理教育和临床决策中的潜在价值。