School of Computing and Mathematical Sciences, Auckland University of Technology, Auckland, New Zealand.
Med Educ. 2010 Apr;44(4):367-78. doi: 10.1111/j.1365-2923.2009.03606.x.
There is growing interest in multi-source, multi-level feedback for measuring the performance of health care professionals. However, data are often unbalanced (e.g. there are different numbers of raters for each doctor), uncrossed (e.g. raters rate the doctor on only one occasion) and fully nested (e.g. raters for a doctor are unique to that doctor). Estimating the true score variance among doctors under these circumstances is proving a challenge.
Extensions to reliability and generalisability (G) formulae are introduced to handle unbalanced, uncrossed and fully nested data to produce coefficients that take into account variances among raters, ratees and questionnaire items at different levels of analysis. Decision (D) formulae are developed to handle predictions of minimum numbers of raters for unbalanced studies. An artificial dataset and two real-world datasets consisting of colleague and patient evaluations of doctors are analysed to demonstrate the feasibility and relevance of the formulae. Another independent dataset is used for validating D predictions of G coefficients for varying numbers of raters against actual G coefficients. A combined G coefficient formula is introduced for estimating multi-sourced reliability.
The results from the formulae indicate that it is possible to estimate reliability and generalisability in unbalanced, fully nested and uncrossed studies, and to identify extraneous variance that can be removed to estimate true score variance among doctors. The validation results show that it is possible to predict the minimum numbers of raters even if the study is unbalanced.
Calculating G and D coefficients for psychometric data based on feedback on doctor performance is possible even when the data are unbalanced, uncrossed and fully nested, provided that: (i) variances are separated at the rater and ratee levels, and (ii) the average number of raters per ratee is used in calculations for deriving these coefficients.
人们对多源、多层次反馈越来越感兴趣,以衡量医疗保健专业人员的绩效。然而,数据通常是不平衡的(例如,每个医生的评估者数量不同)、未交叉的(例如,评估者仅在一次机会中评估医生)和完全嵌套的(例如,医生的评估者是唯一的)。在这些情况下,估计医生之间的真实分数方差是一个挑战。
引入可靠性和可推广性(G)公式的扩展,以处理不平衡、未交叉和完全嵌套的数据,生成考虑评估者、被评估者和问卷项目在不同分析水平之间方差的系数。开发决策(D)公式以处理不平衡研究中预测最少评估者数量的问题。分析了一个人工数据集和两个由同事和患者对医生的评估组成的真实世界数据集,以证明公式的可行性和相关性。另一个独立数据集用于验证 D 对不同评估者数量的 G 系数预测与实际 G 系数的一致性。引入了一个综合 G 系数公式,用于估计多源可靠性。
公式的结果表明,即使在不平衡、完全嵌套和未交叉的研究中,也有可能估计可靠性和可推广性,并识别可以去除以估计医生之间真实分数方差的多余方差。验证结果表明,即使研究不平衡,也有可能预测最少的评估者数量。
即使数据不平衡、未交叉和完全嵌套,也可以根据医生绩效反馈计算心理测量数据的 G 和 D 系数,前提是:(i)在评估者和被评估者水平上分离方差,以及(ii)在计算这些系数时使用被评估者的平均评估者数量。