School of Medicine, University of Leeds, Leeds, LS2 JT, UK.
Adv Health Sci Educ Theory Pract. 2024 Jul;29(3):919-934. doi: 10.1007/s10459-023-10289-w. Epub 2023 Oct 16.
Quantitative measures of systematic differences in OSCE scoring across examiners (often termed examiner stringency) can threaten the validity of examination outcomes. Such effects are usually conceptualised and operationalised based solely on checklist/domain scores in a station, and global grades are not often used in this type of analysis. In this work, a large candidate-level exam dataset is analysed to develop a more sophisticated understanding of examiner stringency. Station scores are modelled based on global grades-with each candidate, station and examiner allowed to vary in their ability/stringency/difficulty in the modelling. In addition, examiners are also allowed to vary in how they discriminate across grades-to our knowledge, this is the first time this has been investigated. Results show that examiners contribute strongly to variance in scoring in two distinct ways-via the traditional conception of score stringency (34% of score variance), but also in how they discriminate in scoring across grades (7%). As one might expect, candidate and station account only for a small amount of score variance at the station-level once candidate grades are accounted for (3% and 2% respectively) with the remainder being residual (54%). Investigation of impacts on station-level candidate pass/fail decisions suggest that examiner differential stringency effects combine to give false positive (candidates passing in error) and false negative (failing in error) rates in stations of around 5% each but at the exam-level this reduces to 0.4% and 3.3% respectively. This work adds to our understanding of examiner behaviour by demonstrating that examiners can vary in qualitatively different ways in their judgments. For institutions, it emphasises the key message that it is important to sample widely from the examiner pool via sufficient stations to ensure OSCE-level decisions are sufficiently defensible. It also suggests that examiner training should include discussion of global grading, and the combined effect of scoring and grading on candidate outcomes.
定量衡量 OSCE 评分中考核者之间的系统差异(通常称为考核者严格程度)可能会威胁考试结果的有效性。此类影响通常仅基于站中的检查表/域评分来概念化和操作化,并且在这种类型的分析中通常不使用总体成绩。在这项工作中,分析了大量候选人级别的考试数据集,以更深入地了解考核者的严格程度。根据总体成绩对站成绩进行建模-每个候选人、站和考核者都可以在建模中改变其能力/严格程度/难度。此外,还允许考核者在成绩之间进行区分-据我们所知,这是首次对此进行研究。结果表明,考核者以两种截然不同的方式对评分差异做出了很大贡献-通过传统的评分严格程度概念(评分差异的 34%),但也通过他们在成绩之间的区分方式(7%)。正如人们可能预期的那样,一旦考虑到候选人成绩,候选人成绩仅占站级评分差异的很小一部分(分别为 3%和 2%),其余部分为剩余部分(54%)。对站级候选人及格/不及格决策的影响的调查表明,考核者的差异性严格程度效应结合在一起,导致错误地通过(错误通过的候选人)和错误地失败(错误失败的候选人)的比例约为每个站 5%,但在考试级别,这分别降低到 0.4%和 3.3%。这项工作通过证明考核者可以在其判断中以不同的方式进行定性差异,从而增加了对考核者行为的理解。对于机构而言,它强调了一个关键信息,即通过足够的站点从考核者群体中广泛抽样,以确保 OSCE 级别的决策具有足够的说服力非常重要。它还表明,考核者培训应包括讨论总体评分以及评分和评分对候选人结果的综合影响。