Smith Samantha Eve, McColgan-Smith Scott, Stewart Fiona, Mardon Julie, Tallentire Victoria Ruth
Centre for Medical Education, University of Dundee, Dundee, UK.
NHS Education for Scotland, Glasgow, UK.
Adv Simul (Lond). 2024 Dec 31;9(1):55. doi: 10.1186/s41077-024-00329-9.
Behavioural marker systems are used across several healthcare disciplines to assess behavioural (non-technical) skills, but rater training is variable, and inter-rater reliability is generally poor. Inter-rater reliability provides data about the tool, but not the competence of individual raters. This study aimed to test the inter-rater reliability of a new behavioural marker system (PhaBS - pharmacists' behavioural skills) with clinically experienced faculty raters and near-peer raters. It also aimed to assess rater competence when using PhaBS after brief familiarisation, by assessing completeness, agreement with an expert rater, ability to rank performance, stringency or leniency, and avoidance of the halo effect.
Clinically experienced faculty raters and near-peer raters attended a 30-min PhaBS familiarisation session. This was immediately followed by a marking session in which they rated a trainee pharmacist's behavioural skills in three scripted immersive acute care simulated scenarios, demonstrating good, mediocre, and poor performances respectively. Inter-rater reliability in each group was calculated using the two-way random, absolute agreement single-measures intra-class correlation co-efficient (ICC). Differences in individual rater competence in each domain were compared using Pearson's chi-squared test.
The ICC for experienced faculty raters was good at 0.60 (0.48-0.72) and for near-peer raters was poor at 0.38 (0.27-0.54). Of experienced faculty raters, 5/9 were competent in all domains versus 2/13 near-peer raters (difference not statistically significant). There was no statistically significant difference between the abilities of clinically experienced versus near-peer raters in agreement with an expert rater, ability to rank performance, stringency or leniency, or avoidance of the halo effect. The only statistically significant difference between groups was ability to compete the assessment (9/9 experienced faculty raters versus 6/13 near-peer raters, p = 0.0077).
Experienced faculty have acceptable inter-rater reliability when using PhaBS, consistent with other behaviour marker systems; however, not all raters are competent. Competence measures for other assessments can be helpfully applied to behavioural marker systems. When using behavioural marker systems for assessment, educators must start using such rater competence frameworks. This is important to ensure fair and accurate assessments for learners, to provide educators with information about rater training programmes, and to provide individual raters with meaningful feedback.
行为标记系统在多个医疗保健学科中用于评估行为(非技术)技能,但评分者培训方式不一,评分者间信度通常较差。评分者间信度提供了有关工具的数据,但未涉及单个评分者的能力。本研究旨在测试一种新的行为标记系统(药剂师行为技能评估系统,PhaBS)在临床经验丰富的教师评分者和近伴评分者之间的评分者间信度。研究还旨在通过评估完整性、与专家评分者的一致性、对表现进行排名的能力、严格或宽松程度以及避免光环效应,来评估在简短熟悉之后使用PhaBS时评分者的能力。
临床经验丰富的教师评分者和近伴评分者参加了为期30分钟的PhaBS熟悉课程。随后立即进行评分环节,他们在三个模拟沉浸式急性护理脚本场景中对一名实习药剂师的行为技能进行评分,这些场景分别展示了良好、中等和较差的表现。每组的评分者间信度使用双向随机、绝对一致性单测量组内相关系数(ICC)进行计算。使用Pearson卡方检验比较每个领域中单个评分者能力的差异。
经验丰富的教师评分者的ICC为良好,为0.60(0.48 - 0.72),近伴评分者的ICC较差,为0.38(0.27 - 0.54)。经验丰富的教师评分者中,9人中有5人在所有领域都具备能力,而近伴评分者中13人中有2人具备能力(差异无统计学意义)。在与专家评分者的一致性、对表现进行排名的能力、严格或宽松程度以及避免光环效应方面,临床经验丰富的评分者和近伴评分者的能力之间没有统计学上的显著差异。两组之间唯一具有统计学显著差异的是完成评估的能力(9名经验丰富的教师评分者对13名近伴评分者中的6名,p = 0.0077)。
经验丰富的教师在使用PhaBS时具有可接受的评分者间信度,这与其他行为标记系统一致;然而,并非所有评分者都具备能力。其他评估的能力衡量方法可有效地应用于行为标记系统。在使用行为标记系统进行评估时,教育工作者必须开始使用此类评分者能力框架。这对于确保对学习者进行公平准确的评估、为教育工作者提供有关评分者培训计划的信息以及为单个评分者提供有意义的反馈非常重要。