Qwen-2.5在中国国家护士执业资格考试中表现优于其他大语言模型：回顾性横断面比较研究。

Qwen-2.5 Outperforms Other Large Language Models in the Chinese National Nursing Licensing Examination: Retrospective Cross-Sectional Comparative Study.

作者信息

Zhu Shiben, Hu Wanqin, Yang Zhi, Yan Jiani, Zhang Fang

机构信息

Department of Infectious Diseases, Nanfang Hospital, Southern Medical University, Guangzhou, China.

State Key Laboratory of Organ Failure Research, Key Laboratory of Infectious Diseases Research in South China, Ministry of Education, Guangdong Provincial Key Laboratory of Viral Hepatitis Research, Guangdong Provincial Clinical Research Center for Viral Hepatitis, Guangdong Institute of Hepatology, Guangzhou, China.

出版信息

JMIR Med Inform. 2025 Jan 10;13:e63731. doi: 10.2196/63731.

DOI:10.2196/63731

PMID:39793017

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11759905/

Abstract

BACKGROUND

Large language models (LLMs) have been proposed as valuable tools in medical education and practice. The Chinese National Nursing Licensing Examination (CNNLE) presents unique challenges for LLMs due to its requirement for both deep domain-specific nursing knowledge and the ability to make complex clinical decisions, which differentiates it from more general medical examinations. However, their potential application in the CNNLE remains unexplored.

OBJECTIVE

This study aims to evaluates the accuracy of 7 LLMs including GPT-3.5, GPT-4.0, GPT-4o, Copilot, ERNIE Bot-3.5, SPARK, and Qwen-2.5 on the CNNLE, focusing on their ability to handle domain-specific nursing knowledge and clinical decision-making. We also explore whether combining their outputs using machine learning techniques can improve their overall accuracy.

METHODS

This retrospective cross-sectional study analyzed all 1200 multiple-choice questions from the CNNLE conducted between 2019 and 2023. Seven LLMs were evaluated on these multiple-choice questions, and 9 machine learning models, including Logistic Regression, Support Vector Machine, Multilayer Perceptron, k-nearest neighbors, Random Forest, LightGBM, AdaBoost, XGBoost, and CatBoost, were used to optimize overall performance through ensemble techniques.

RESULTS

Qwen-2.5 achieved the highest overall accuracy of 88.9%, followed by GPT-4o (80.7%), ERNIE Bot-3.5 (78.1%), GPT-4.0 (70.3%), SPARK (65.0%), and GPT-3.5 (49.5%). Qwen-2.5 demonstrated superior accuracy in the Practical Skills section compared with the Professional Practice section across most years. It also performed well in brief clinical case summaries and questions involving shared clinical scenarios. When the outputs of the 7 LLMs were combined using 9 machine learning models, XGBoost yielded the best performance, increasing accuracy to 90.8%. XGBoost also achieved an area under the curve of 0.961, sensitivity of 0.905, specificity of 0.978, F-score of 0.901, positive predictive value of 0.901, and negative predictive value of 0.977.

CONCLUSIONS

This study is the first to evaluate the performance of 7 LLMs on the CNNLE and that the integration of models via machine learning significantly boosted accuracy, reaching 90.8%. These findings demonstrate the transformative potential of LLMs in revolutionizing health care education and call for further research to refine their capabilities and expand their impact on examination preparation and professional training.

摘要

背景

大语言模型（LLMs）已被提议作为医学教育和实践中的宝贵工具。中国国家护士执业资格考试（CNNLE）对大语言模型提出了独特的挑战，因为它既需要深厚的特定领域护理知识，又需要做出复杂临床决策的能力，这使其有别于更一般的医学考试。然而，它们在CNNLE中的潜在应用仍未得到探索。

目的

本研究旨在评估包括GPT-3.5、GPT-4.0、GPT-4o、Copilot、文心一言-3.5、灵雀、通义千问2.5在内的7个大语言模型在CNNLE上的准确性，重点关注它们处理特定领域护理知识和临床决策的能力。我们还探讨了使用机器学习技术合并它们的输出是否可以提高其整体准确性。

方法

这项回顾性横断面研究分析了2019年至2023年期间举行的CNNLE的所有1200道多项选择题。在这些多项选择题上对7个大语言模型进行了评估，并使用9种机器学习模型，包括逻辑回归、支持向量机、多层感知器、k近邻、随机森林、LightGBM、AdaBoost、XGBoost和CatBoost，通过集成技术优化整体性能。

结果

通义千问2.5的整体准确率最高，为88.9%，其次是GPT-4o（80.7%）、文心一言-3.5（78.1%）、GPT-4.0（70.3%）、灵雀（65.0%）和GPT-3.5（49.5%）。在大多数年份中，通义千问2.5在实践技能部分的准确率高于专业实践部分。它在简短的临床病例摘要和涉及共享临床场景的问题上也表现出色。当使用9种机器学习模型合并7个大语言模型的输出时，XGBoost的性能最佳，准确率提高到90.8%。XGBoost还实现了曲线下面积为0.961，灵敏度为0.905，特异性为0.978，F值为0.901，阳性预测值为0.901，阴性预测值为0.977。