Education Department, Royal College of Physicians of London, 11 St Andrews Place, Regent’s Park, London, UK.
Med Educ. 2011 Aug;45(8):843-8. doi: 10.1111/j.1365-2923.2011.03996.x.
Multi-source feedback (MSF) provides a window into complex areas of performance in real workplace settings. However, because MSF elicits subjective judgements, many respondents are needed to achieve a reliable assessment. Optimising the consistency with which questions are interpreted will help reliability.
We compared two parallel forms of an MSF instrument with identical wording and administration procedures. The original instrument contained 10 compound performance items and was used 12,540 times to assess 977 doctors, including 112 general practitioners (GPs). The modified instrument contained the same wording in 21 non-compound items, each of which asked about a single aspect of performance, and was used 2789 times to assess 205 doctors, all of whom were GPs. Generalisability analysis evaluated questionnaire reliability. The reliability of the original instrument was evaluated for both the whole group and the GP subgroup.
The two instruments provided similar numbers of responses per doctor. The modified instrument generated more reliable scores. The whole-group comparison examined precision, measured as standard error of measurement (SEM); seven respondents were sufficient to achieve a 95% confidence interval of 0.25 (on a 4-point scale) with the modified instrument, compared with 10 respondents using the original instrument. The subgroup comparison examined the generalisability coefficient; 15 responses provided a reliability of 0.72 using the modified instrument or 0.58 using the original instrument.
Non-compound questions improved the consistency of scores. We recommend that compound questions be avoided in assessment instrument design.
多源反馈(MSF)提供了一个在真实工作环境中观察绩效复杂领域的窗口。然而,由于 MSF 引出了主观判断,因此需要大量的受访者来实现可靠的评估。优化问题解释的一致性将有助于提高可靠性。
我们比较了两种具有相同措辞和管理程序的 MSF 工具的平行形式。原始工具包含 10 个复合绩效项目,用于评估 977 名医生,包括 112 名全科医生(GP),共 12540 次。修改后的工具包含相同措辞的 21 个非复合项目,每个项目都询问了绩效的一个单一方面,用于评估 205 名医生,均为 GP,共 2789 次。概化分析评估了问卷的可靠性。对整个小组和 GP 小组分别评估了原始工具的可靠性。
两种工具为每位医生提供了相似数量的回复。修改后的工具产生了更可靠的分数。全组比较检查了精度,以测量标准测量误差(SEM)表示;使用修改后的工具,有 7 位受访者即可实现 95%置信区间为 0.25(在 4 分制上),而使用原始工具则需要 10 位受访者。亚组比较检查了通用性系数;使用修改后的工具,15 个回复提供了 0.72 的可靠性,而使用原始工具则为 0.58。
非复合问题提高了分数的一致性。我们建议在评估工具设计中避免使用复合问题。