Department of Health Research Methods, Evidence and Impact, McMaster University, Hamilton, Canada.
Touchstone Institute, Toronto, Canada.
Perspect Med Educ. 2018 Apr;7(2):110-119. doi: 10.1007/s40037-018-0410-4.
Tablet-based assessments offer benefits over scannable-paper assessments; however, there is little known about the impact to the variability of assessment scores.
Two studies were conducted to evaluate changes in rating technology. Rating modality (paper vs tablets) was manipulated between candidates (Study 1) and within candidates (Study 2). Average scores were analyzed using repeated measures ANOVA, Cronbach's alpha and generalizability theory. Post-hoc analyses included a Rasch analysis and McDonald's omega.
Study 1 revealed a main effect of modality (F (1,152) = 25.06, p < 0.01). Average tablet-based scores were higher, (3.39/5, 95% CI = 3.28 to 3.51), compared with average scan-sheet scores (3.00/5, 95% CI = 2.90 to 3.11). Study 2 also revealed a main effect of modality (F (1, 88) = 15.64, p < 0.01), however, the difference was reduced to 2% with higher scan-sheet scores (3.36, 95% CI = 3.30 to 3.42) compared with tablet scores (3.27, 95% CI = 3.21 to 3.33). Internal consistency (alpha and omega) remained high (>0.8) and inter-station reliability remained constant (0.3). Rasch analyses showed no relationship between station difficulty and rating modality.
Analyses of average scores may be misleading without an understanding of internal consistency and overall reliability of scores. Although updating to tablet-based forms did not result in systematic variations in scores, routine analyses ensured accurate interpretation of the variability of assessment scores.
This study demonstrates the importance of ongoing program evaluation and data analysis.
基于平板电脑的评估相较于可扫描纸质评估具有优势;然而,关于评估分数变异性的影响却知之甚少。
进行了两项研究以评估评分技术的变化。在研究 1 中,被试者之间(研究 1)和被试者内部(研究 2)操纵了评分模式(纸质与平板电脑)。使用重复测量方差分析、克朗巴赫的α和概化理论对平均分数进行分析。事后分析包括拉什分析和麦克唐纳的ω。
研究 1 显示了模式的主要效果(F(1,152)= 25.06,p < 0.01)。基于平板电脑的平均分数更高,(3.39/5,95%置信区间= 3.28 至 3.51),而扫描纸的平均分数为(3.00/5,95%置信区间= 2.90 至 3.11)。研究 2 也显示了模式的主要效果(F(1,88)= 15.64,p < 0.01),然而,扫描纸的分数更高,差异缩小到 2%(3.36,95%置信区间= 3.30 至 3.42),而平板电脑的分数为(3.27,95%置信区间= 3.21 至 3.33)。内部一致性(α和ω)保持较高(>0.8),站点间可靠性保持不变(0.3)。拉什分析显示,站点难度与评分模式之间没有关系。
如果不了解分数的内部一致性和整体可靠性,对平均分数的分析可能会产生误导。尽管更新为基于平板电脑的形式并没有导致分数的系统性变化,但常规分析确保了评估分数变异性的准确解释。
本研究表明持续的项目评估和数据分析的重要性。