大语言模型与北美药师执照考试（NAPLEX）练习题。

Large Language Models and the North American Pharmacist Licensure Examination (NAPLEX) Practice Questions.

机构信息

University of North Carolina at Chapel Hill, UNC Eshelman School of Pharmacy, Division of Pharmaceutical Outcomes and Policy, Chapel Hill, NC, USA.

Stanford University School of Medicine, Department of Biomedical Data Science, Stanford, CA, USA.

出版信息

Am J Pharm Educ. 2024 Nov;88(11):101294. doi: 10.1016/j.ajpe.2024.101294. Epub 2024 Sep 20.

DOI:10.1016/j.ajpe.2024.101294

PMID:39307190

Abstract

OBJECTIVE

This study aims to test the accuracy of large language models (LLMs) in answering standardized pharmacy examination practice questions.

METHODS

The performance of 3 LLMs (generative pretrained transformer [GPT]-3.5, GPT-4, and Chatsonic) was evaluated on 2 independent North American Pharmacist Licensure Examination practice question sets sourced from McGraw Hill and RxPrep. These question sets were further classified into binary question categories of adverse drug reaction (ADR) questions, scenario questions, treatment questions, and select-all questions. Python was used to run χ tests to compare model and question-type accuracy.

RESULTS

Of the 3 LLMs tested, GPT-4 achieved the highest accuracy, with 87% accuracy on the McGraw Hill question set and 83.5% accuracy on the RxPrep question set. In comparison, GPT-3.5 had 68.0% and 60.0% accuracy on those question sets, respectively, and Chatsonic had 60.5% and 62.5% accuracy on those question sets, respectively. All models performed worse on select-all questions compared with non-select-all questions (GPT-3: 42.3% vs 66.2%; GPT-4: 73.1 vs 87.2%; Chatsonic: 36.5% vs 71.6%). GPT-4 had statistically higher accuracy in answering ADR questions (96.1%) compared with non-ADR questions (83.9%).

CONCLUSION

Our study found that GPT-4 outperformed GPT-3.5 and Chatsonic in answering North American Pharmacist Licensure Examination pharmacy licensure examination practice questions, particularly excelling in answering questions related to ADRs. These results suggest that advanced LLMs such as GPT-4 could be used for applications in pharmacy education.

摘要

目的

本研究旨在测试大型语言模型（LLM）回答标准化药剂师执照考试实践问题的准确性。

方法

使用 3 种 LLM（生成式预训练转换器[GPT]-3.5、GPT-4 和 Chatsonic）对来自麦格劳希尔和 RxPrep 的 2 套独立的北美药剂师执照考试实践问题集进行评估。这些问题集进一步分为药物不良反应（ADR）问题、情景问题、治疗问题和多选问题的二进制问题类别。使用 Python 运行 χ 检验比较模型和问题类型的准确性。

结果

在测试的 3 种 LLM 中，GPT-4 的准确率最高，在麦格劳希尔问题集上的准确率为 87%，在 RxPrep 问题集上的准确率为 83.5%。相比之下，GPT-3.5 在这两个问题集上的准确率分别为 68.0%和 60.0%，Chatsonic 在这两个问题集上的准确率分别为 60.5%和 62.5%。与非多选问题相比，所有模型在多选问题上的表现都更差（GPT-3：42.3%对 66.2%；GPT-4：73.1%对 87.2%；Chatsonic：36.5%对 71.6%）。GPT-4 在回答 ADR 问题（96.1%）方面的准确率明显高于非 ADR 问题（83.9%）。