• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

大语言模型与北美药师执照考试(NAPLEX)练习题。

Large Language Models and the North American Pharmacist Licensure Examination (NAPLEX) Practice Questions.

机构信息

University of North Carolina at Chapel Hill, UNC Eshelman School of Pharmacy, Division of Pharmaceutical Outcomes and Policy, Chapel Hill, NC, USA.

Stanford University School of Medicine, Department of Biomedical Data Science, Stanford, CA, USA.

出版信息

Am J Pharm Educ. 2024 Nov;88(11):101294. doi: 10.1016/j.ajpe.2024.101294. Epub 2024 Sep 20.

DOI:10.1016/j.ajpe.2024.101294
PMID:39307190
Abstract

OBJECTIVE

This study aims to test the accuracy of large language models (LLMs) in answering standardized pharmacy examination practice questions.

METHODS

The performance of 3 LLMs (generative pretrained transformer [GPT]-3.5, GPT-4, and Chatsonic) was evaluated on 2 independent North American Pharmacist Licensure Examination practice question sets sourced from McGraw Hill and RxPrep. These question sets were further classified into binary question categories of adverse drug reaction (ADR) questions, scenario questions, treatment questions, and select-all questions. Python was used to run χ tests to compare model and question-type accuracy.

RESULTS

Of the 3 LLMs tested, GPT-4 achieved the highest accuracy, with 87% accuracy on the McGraw Hill question set and 83.5% accuracy on the RxPrep question set. In comparison, GPT-3.5 had 68.0% and 60.0% accuracy on those question sets, respectively, and Chatsonic had 60.5% and 62.5% accuracy on those question sets, respectively. All models performed worse on select-all questions compared with non-select-all questions (GPT-3: 42.3% vs 66.2%; GPT-4: 73.1 vs 87.2%; Chatsonic: 36.5% vs 71.6%). GPT-4 had statistically higher accuracy in answering ADR questions (96.1%) compared with non-ADR questions (83.9%).

CONCLUSION

Our study found that GPT-4 outperformed GPT-3.5 and Chatsonic in answering North American Pharmacist Licensure Examination pharmacy licensure examination practice questions, particularly excelling in answering questions related to ADRs. These results suggest that advanced LLMs such as GPT-4 could be used for applications in pharmacy education.

摘要

目的

本研究旨在测试大型语言模型(LLM)回答标准化药剂师执照考试实践问题的准确性。

方法

使用 3 种 LLM(生成式预训练转换器[GPT]-3.5、GPT-4 和 Chatsonic)对来自麦格劳希尔和 RxPrep 的 2 套独立的北美药剂师执照考试实践问题集进行评估。这些问题集进一步分为药物不良反应(ADR)问题、情景问题、治疗问题和多选问题的二进制问题类别。使用 Python 运行 χ 检验比较模型和问题类型的准确性。

结果

在测试的 3 种 LLM 中,GPT-4 的准确率最高,在麦格劳希尔问题集上的准确率为 87%,在 RxPrep 问题集上的准确率为 83.5%。相比之下,GPT-3.5 在这两个问题集上的准确率分别为 68.0%和 60.0%,Chatsonic 在这两个问题集上的准确率分别为 60.5%和 62.5%。与非多选问题相比,所有模型在多选问题上的表现都更差(GPT-3:42.3%对 66.2%;GPT-4:73.1%对 87.2%;Chatsonic:36.5%对 71.6%)。GPT-4 在回答 ADR 问题(96.1%)方面的准确率明显高于非 ADR 问题(83.9%)。

结论

我们的研究发现,GPT-4 在回答北美药剂师执照考试实践问题方面优于 GPT-3.5 和 Chatsonic,特别是在回答与 ADR 相关的问题方面表现出色。这些结果表明,像 GPT-4 这样的先进 LLM 可以用于药学教育应用。

相似文献

1
Large Language Models and the North American Pharmacist Licensure Examination (NAPLEX) Practice Questions.大语言模型与北美药师执照考试(NAPLEX)练习题。
Am J Pharm Educ. 2024 Nov;88(11):101294. doi: 10.1016/j.ajpe.2024.101294. Epub 2024 Sep 20.
2
The NAPLEX: evolution, purpose, scope, and educational implications.《北美药剂师执照考试:演变、目的、范围及教育意义》
Am J Pharm Educ. 2008 Apr 15;72(2):33. doi: 10.5688/aj720233.
3
Association Between Accreditation Era, North American Pharmacist Licensure Examination Testing Changes, and First-Time Pass Rates.认证时代、北美药剂师执照考试测试变化与首次通过率之间的关联。
Am J Pharm Educ. 2023 Apr;87(3):ajpe8994. doi: 10.5688/ajpe8994. Epub 2022 Jul 15.
4
Frequency of Course Remediation and the Effect on North American Pharmacist Licensure Examination Pass Rates.课程重修频率与北美药剂师执照考试通过率的关系。
Am J Pharm Educ. 2023 Mar;87(2):ajpe8894. doi: 10.5688/ajpe8894. Epub 2022 Apr 8.
5
Stratified Evaluation of GPT's Question Answering in Surgery Reveals Artificial Intelligence (AI) Knowledge Gaps.对GPT在外科手术中问答的分层评估揭示了人工智能(AI)的知识差距。
Cureus. 2023 Nov 14;15(11):e48788. doi: 10.7759/cureus.48788. eCollection 2023 Nov.
6
Evaluating Large Language Models for the National Premedical Exam in India: Comparative Analysis of GPT-3.5, GPT-4, and Bard.评估印度全国医预考用大型语言模型:GPT-3.5、GPT-4 和 Bard 的比较分析。
JMIR Med Educ. 2024 Feb 21;10:e51523. doi: 10.2196/51523.
7
The North American Pharmacist Licensure Examination (NAPLEX) Pass Rate Conundrum.北美药师执照考试(NAPLEX)通过率之谜。
Am J Pharm Educ. 2024 May;88(5):100701. doi: 10.1016/j.ajpe.2024.100701. Epub 2024 Apr 17.
8
Capability of GPT-4V(ision) in the Japanese National Medical Licensing Examination: Evaluation Study.GPT-4V(视觉)在日本国家医师资格考试中的能力:评估研究。
JMIR Med Educ. 2024 Mar 12;10:e54393. doi: 10.2196/54393.
9
Influence of Model Evolution and System Roles on ChatGPT's Performance in Chinese Medical Licensing Exams: Comparative Study.模型演进和系统角色对 ChatGPT 在中文医师资格考试中表现的影响:对比研究。
JMIR Med Educ. 2024 Aug 13;10:e52784. doi: 10.2196/52784.
10
Performance of ChatGPT on Nursing Licensure Examinations in the United States and China: Cross-Sectional Study.ChatGPT 在中美护理执照考试中的表现:横断面研究。
JMIR Med Educ. 2024 Oct 3;10:e52746. doi: 10.2196/52746.

引用本文的文献

1
Evaluating the Accuracy, Reliability, Consistency, and Readability of Different Large Language Models in Restorative Dentistry.评估不同大语言模型在口腔修复学中的准确性、可靠性、一致性和可读性。
J Esthet Restor Dent. 2025 Jul;37(7):1740-1752. doi: 10.1111/jerd.13447. Epub 2025 Mar 2.