Faculty of Medicine, Université de Montréal, Montreal, Canada.
CHU Sainte-Justine, 3175 Chemin de la Côte-Sainte-Catherine, Montreal, QC, H3T 1C5, Canada.
Adv Health Sci Educ Theory Pract. 2021 Mar;26(1):37-51. doi: 10.1007/s10459-020-09970-1. Epub 2020 May 6.
When determining the score given to candidates in multiple mini-interview (MMI) stations, raters have to translate a narrative judgment to an ordinal rating scale. When adding individual scores to calculate final ranking, it is generally presumed that the values of possible scores on the evaluation grid are separated by constant intervals, following a linear function, although this assumption is seldom validated with raters themselves. Inaccurate interval values could lead to systemic bias that could potentially distort candidates' final cumulative scores. The aim of this study was to establish rating scale values based on rater's intent, to validate these with an independent quantitative method, to explore their impact on final score, and to appraise their meaning according to experienced MMI interviewers. A 4-round consensus-group exercise was independently conducted with 42 MMI interviewers who were asked to determine relative values for the 6-point rating scale (from A to F) used in the Canadian integrated French MMI (IFMMI). In parallel, relative values were also calculated for each option of the scale by comparing the average scores concurrently given to the same individual in other stations every time that option was selected during three consecutive IFMMI years. Data from the same three cohorts was used to simulate the impact of using new score values on final rankings. Comments from the consensus group exercise were reviewed independently by two authors to explore raters' rationale for choosing specific values. Relative to the maximum (A = 100%) and minimum (F = 0%), experienced raters concluded to values of 86.7% (95% CI 86.3-87.1), 69.5% (68.9-70.1), 51.2% (50.6-51.8), and 29.3% (28.1-30.5), for scores of B, C, D and E respectively. The concurrent score approach was based on 43,412 IFMMI stations performed by 4345 medical school applicants. It provided quasi-identical values of 87.1% (82.4-91.5), 70.4% (66.1-74.7), 51.2% (47.1-55.3) and 31.8% (27.9-35.7), respectively. Qualitative analysis explained that while high scores are usually based on minor details of relatively low importance, low scores are usually attributed for more serious offenses and were assumed by the raters to carry more weight in the final score. Individual drop or increase in final MMI ranking with the use of new scale values ranged from - 21 to + 5 percentiles, with the average candidate changing by ± 1.4 percentiles. Consulting with experienced interviewers is a simple and effective approach to establish rating scale values that truly reflects raters' intent in MMI, thus improving the accuracy of the instrument and contributing to the general fairness of the process.
当在多站迷你面试(MMI)中确定候选人的分数时,评分者必须将叙述性判断转换为序数评分量表。在添加个人分数以计算最终排名时,通常假定评估网格上的可能分数值之间的间隔是恒定的,遵循线性函数,尽管很少有评分者自己验证这种假设。不准确的间隔值可能导致系统偏差,从而可能扭曲候选人的最终累积分数。本研究的目的是根据评分者的意图确定评分量表值,使用独立的定量方法对其进行验证,探讨其对最终分数的影响,并根据有经验的 MMI 面试官评估其意义。我们进行了四轮共识小组练习,共有 42 名 MMI 面试官参加,他们被要求确定加拿大综合法语 MMI(IFMMI)中使用的 6 分制(从 A 到 F)的相对值。同时,还通过比较每个选项在三年连续 IFMMI 中每次选择时同时给予同一人的平均分数,计算出每个选项的相对值。同一三个队列的数据用于模拟使用新分数值对最终排名的影响。两位作者独立审查了共识小组练习的评论,以探讨评分者选择特定值的基本原理。与最大值(A=100%)和最小值(F=0%)相比,有经验的评分者得出的分数分别为 86.7%(95%CI 86.3-87.1)、69.5%(68.9-70.1)、51.2%(50.6-51.8)和 29.3%(28.1-30.5),分别为 B、C、D 和 E 级的分数。同时评分方法基于 4345 名医学院申请者进行的 4345 个 IFMMI 站。它提供了几乎相同的 87.1%(82.4-91.5)、70.4%(66.1-74.7)、51.2%(47.1-55.3)和 31.8%(27.9-35.7)的值。定性分析解释说,虽然高分通常基于相对不重要的细节,但低分通常归因于更严重的违规行为,并且评分者认为这些违规行为在最终分数中更具分量。使用新量表值时,最终 MMI 排名的个人下降或上升幅度在-21%至+5%之间,平均候选人的变化在±1.4%之间。与有经验的面试官协商是确定真正反映 MMI 中评分者意图的评分量表值的简单而有效的方法,从而提高了工具的准确性,并有助于提高整个过程的公平性。