Institute of Health, Jimma University, Jimma, Ethiopia.
Faculty of Medicine, Institute of Health and Society, University of Oslo, Oslo, Norway.
BMC Med Educ. 2024 Sep 16;24(1):1016. doi: 10.1186/s12909-024-06012-x.
The ability of an expert's item difficulty ratings to predict test-taker actual performance is an important aspect of licensure examinations. Expert judgment is used as a primary source of information for users to make prior decisions to determine the pass rate of test takers. The nature of raters involved in predicting item difficulty is central to set credible standards. Therefore, this study aimed to assess and compare raters' prediction and actual Multiple-Choice Questions' difficulty of the undergraduate medicine licensure examination (UGMLE) in Ethiopia.
815 examinees' responses to 200 Multiple-Choice Questions (MCQs) were used in this study. The study also included experts' item difficulty ratings of seven physicians who participated in the standard settings of UGMLE. Then, analysis was conducted to understand experts' rating variation in predicting the actual difficulty levels of examinees. Descriptive statistics was used to profile the mean rater's and actual difficulty value for MCQs, and ANOVA was used to compare the mean differences between raters' prediction of item difficulty. Additionally, regression analysis was used to understand the interrater variations in item difficulty predictions compared to the actual difficulty. The proportion of variance of actual difficulty explained from rater prediction was computed using regression analysis.
In this study, the mean difference between raters' prediction and examinees' actual performance was inconsistent across the exam domains. The study revealed a statistically significant strong positive correlation between the actual and predicted item difficulty in exam domains eight and eleven. However, a non-statistically significant very weak positive correlation was reported in exam domains seven and twelve. The multiple comparison analysis showed significant differences in mean item difficulty ratings between raters. In the regression analysis, experts' item difficulty ratings of the UGMLE had 33% power in predicting the actual difficulty level. The regression model also showed a moderate positive correlation (R = 0.57) that was statistically significant at F (6, 193) = 15.58, P = 0.001.
This study demonstrated the complex process for assessing the difficulty level of MCQs in the UGMLE and emphasized the benefits of using experts' ratings in advance. To ensure the exams maintain the necessary reliable and valid scores, raters' accuracy on the UGMLE must be improved. To achieve this, techniques that align with the evolving assessment methodologies must be developed.
专家对项目难度的评分能力能够预测应试者的实际表现,这是执照考试的一个重要方面。专家判断是用户做出预先决策以确定应试者通过率的主要信息来源。参与预测项目难度的评分者的性质对于制定可信标准至关重要。因此,本研究旨在评估和比较埃塞俄比亚本科医学执照考试(UGMLE)中评分者对多项选择题(MCQ)难度的预测和实际难度。
本研究使用了 815 名应试者对 200 道多项选择题(MCQ)的回答。研究还包括参与 UGMLE 标准制定的七名医生的专家项目难度评分。然后,进行了分析以了解专家在预测应试者实际难度水平方面的评分变化。使用描述性统计来分析 MCQ 的平均评分者和实际难度值,并使用方差分析比较评分者对项目难度预测的平均差异。此外,使用回归分析了解项目难度预测的评分者之间的差异与实际难度之间的关系。使用回归分析计算了实际难度从评分者预测中解释的方差比例。
在这项研究中,评分者的预测与应试者的实际表现之间的差异在考试领域之间不一致。研究表明,在考试领域八和十一中,实际难度和预测难度之间存在统计学上显著的强正相关。然而,在考试领域七和十二中,报告了非统计学上非常弱的正相关。多比较分析显示评分者之间的平均项目难度评分存在显著差异。在回归分析中,UGMLE 的专家项目难度评分有 33%的能力可以预测实际难度水平。回归模型还显示出中度正相关(R=0.57),在 F(6, 193)=15.58 时具有统计学意义,P=0.001。
本研究展示了评估 UGMLE 多项选择题难度的复杂过程,并强调了提前使用专家评分的好处。为了确保考试保持必要的可靠和有效分数,必须提高评分者在 UGMLE 上的准确性。为了实现这一目标,必须开发与不断发展的评估方法相一致的技术。