Shultz Benjamin, DiDomenico Robert J, Goliak Kristen, Mucksavage Jeffrey
University of Illinois Chicago, Retzky College of Pharmacy, Chicago, IL, USA.
University of Illinois Chicago, Retzky College of Pharmacy, Chicago, IL, USA.
Am J Pharm Educ. 2025 May;89(5):101405. doi: 10.1016/j.ajpe.2025.101405. Epub 2025 Apr 15.
To evaluate the effectiveness of GPT-4 in generating valid multiple-choice exam items for assessing therapeutic knowledge in pharmacy education.
A custom GPT application was developed to create 60 case-based items from a pharmacotherapy textbook. Nine subject matter experts reviewed items for content validity, difficulty, and quality. Valid items were compiled into a 38-question exam administered to 46 fourth-year pharmacy students. Classical test theory and Rasch analysis were used to assess psychometric properties.
Of 60 generated items, 38 met content validity requirements, with only 6 accepted without revisions. The exam demonstrated moderate reliability and correlated well with a prior cumulative therapeutics exam. Classical item analysis revealed that most items had acceptable point biserial correlations, though fewer than half fell within the recommended difficulty range. Rasch analysis indicated potential multidimensionality and suboptimal targeting of item difficulty to student ability.
GPT-4 offers a preliminary step toward generating exam content in pharmacy education but has clear limitations that require further investigation and validation. Substantial human oversight and psychometric evaluation are necessary to ensure clinical realism and appropriate difficulty. Future research with larger samples is needed to further validate the effectiveness of artificial intelligence in item generation for high-stakes assessments in pharmacy education.
评估GPT-4在生成用于评估药学教育中治疗学知识的有效多项选择题方面的有效性。
开发了一个定制的GPT应用程序,从一本药物治疗学教科书中创建60个基于案例的题目。九位学科专家对题目进行了内容效度、难度和质量方面的审查。有效的题目被汇编成一份包含38个问题的考试,施测于46名四年级药学专业学生。使用经典测试理论和Rasch分析来评估心理测量特性。
在生成的60个题目中,38个符合内容效度要求,只有6个未经修改即被接受。该考试显示出中等信度,并且与之前的累积治疗学考试相关性良好。经典题目分析表明,大多数题目具有可接受的点二列相关,尽管只有不到一半的题目落在推荐的难度范围内。Rasch分析表明存在潜在的多维性以及题目难度与学生能力的匹配度欠佳。