Dale Esther Dasari, Abulela Mohammed A A, Jia Hao, Violato Claudio
University of Minnesota Medical School, 420 Delaware Street SE, Mayo Building, Minneapolis, MN, 55455, USA.
Department of Educational Psychology, University of Minnesota, Minneapolis, MN, 55455, USA.
BMC Med Educ. 2025 Jan 29;25(1):146. doi: 10.1186/s12909-024-06540-6.
A common practice in assessment development, fundamental for fairness and consequently the validity of test score interpretations and uses, is to ascertain whether test items function equally across test-taker groups. Accordingly, we conducted differential item functioning (DIF) analysis, a psychometric procedure for detecting potential item bias, for three preclinical medical school foundational courses based on students' sex and race.
The sample included 520, 519, and 344 medical students for anatomy, histology, and physiology, respectively, collected from 2018 to 2020. To conduct DIF analysis, we used the Wald test based on the two-parameter logistic model as utilized in the IRTPRO software.
The three assessments had as many as one-fifth of the items that functioned statistically differentially across one or more of the variables sex and race: 10 out of 49 items (20%), six out of 40 items (15%), 5 out of 45 items (11%) showed statistically significant DIF for Anatomy, Histology, and Physiology courses, respectively. Measurement specialists and subject matter experts independently reviewed the items to identify construct-irrelevant factors as potential sources for DIF as demonstrated in Appendix A. Most identified items were generally poorly written or had unclear images.
The validity of score-based inferences, particularly for group comparisons, requires test items to function equally across test-taker groups. In the present study, we found DIF of some items for sex and race in three content areas. The present approach should be utilized in other medical schools to address the generalizability of the present findings. Item level DIF should also be routinely conducted as part of psychometric analyses for basic sciences courses and other assessments.
Not applicable.
在评估开发中,一项常见做法是确定测试项目在不同考生群体中是否具有同等功能,这对于确保公平性以及测试分数解释和使用的有效性至关重要。因此,我们基于学生的性别和种族,对医学院三个临床前基础课程进行了差异项目功能(DIF)分析,这是一种用于检测潜在项目偏差的心理测量程序。
样本分别包括2018年至2020年收集的520名、519名和344名解剖学、组织学和生理学专业的医学生。为了进行DIF分析,我们使用了IRTPRO软件中基于双参数逻辑模型的Wald检验。
这三项评估中,多达五分之一的项目在性别和种族中的一个或多个变量上存在统计学差异:解剖学课程的49个项目中有10个(20%)、组织学课程的40个项目中有6个(15%)、生理学课程的45个项目中有5个(11%)显示出统计学上显著的DIF。测量专家和学科专家独立审查了这些项目,以确定与结构无关的因素作为DIF的潜在来源,如附录A所示。大多数确定的项目通常编写不佳或图像不清晰。
基于分数的推断的有效性,特别是对于组间比较,要求测试项目在不同考生群体中具有同等功能。在本研究中,我们在三个内容领域发现了一些项目在性别和种族方面存在DIF。本方法应在其他医学院校中使用,以检验本研究结果的普遍性。项目层面的DIF也应作为基础科学课程和其他评估的心理测量分析的一部分定期进行。
不适用。