Wu Salene M, Whiteside Ursula, Neighbors Clayton
University of Washington, Seattle, Washington 98195, WUSA.
Cogn Behav Ther. 2007;36(4):230-9. doi: 10.1080/16506070701584367.
Inter-rater reliability and accuracy are measures of rater performance. Inter-rater reliability is frequently used as a substitute for accuracy despite conceptual differences and literature suggesting important differences between them. The aims of this study were to compare inter-rater reliability and accuracy among a group of raters, using a treatment adherence scale, and to assess for factors affecting the reliability of these ratings. Paired undergraduate raters assessed therapist behavior by viewing videotapes of 4 therapists' cognitive behavioral therapy sessions. Ratings were compared with expert-generated criterion ratings and between raters using intraclass correlation (2,1). Inter-rater reliability was marginally higher than accuracy (p = 0.09). The specific therapist significantly affected inter-rater reliability and accuracy. The frequency and intensity of the therapists' ratable behaviors of criterion ratings correlated only with rater accuracy. Consensus ratings were more accurate than individual ratings, but composite ratings were not more accurate than consensus ratings. In conclusion, accuracy cannot be assumed to exceed inter-rater reliability or vice versa, and both are influenced by multiple factors. In this study, the subject of the ratings (i.e. the therapist and the intensity and frequency of rated behaviors) was shown to influence inter-rater reliability and accuracy. The additional resources needed for a composite rating, a rating based on the average score of paired raters, may be justified by improved accuracy over individual ratings. The additional time required to arrive at a consensus rating, a rating generated following discussion between 2 raters, may not be warranted. Further research is needed to determine whether these findings hold true with other raters and treatment adherence scales.
评分者间信度和准确性是衡量评分者表现的指标。尽管在概念上存在差异且文献表明二者之间存在重要区别,但评分者间信度经常被用作准确性的替代指标。本研究的目的是使用治疗依从性量表比较一组评分者之间的评分者间信度和准确性,并评估影响这些评分可靠性的因素。成对的本科评分者通过观看4位治疗师认知行为治疗 session 的录像带来评估治疗师的行为。将评分与专家生成的标准评分进行比较,并使用组内相关系数(2,1)在评分者之间进行比较。评分者间信度略高于准确性(p = 0.09)。具体的治疗师对评分者间信度和准确性有显著影响。标准评分中治疗师可评分行为的频率和强度仅与评分者的准确性相关。共识评分比个体评分更准确,但综合评分并不比共识评分更准确。总之,不能假定准确性超过评分者间信度,反之亦然,并且两者都受多种因素影响。在本研究中,评分的对象(即治疗师以及被评分行为的强度和频率)被证明会影响评分者间信度和准确性。综合评分(基于成对评分者的平均分的评分)所需的额外资源可能因比个体评分提高的准确性而合理。达成共识评分(两位评分者讨论后生成的评分)所需的额外时间可能并不必要。需要进一步研究以确定这些发现对于其他评分者和治疗依从性量表是否成立。 (注:原文中“cognitive behavioral therapy sessions”直译为“认知行为治疗会话”,结合语境“认知行为治疗环节”等表述可能更通顺;“rated behaviors”直译为“被评分的行为”。另外,“session”在医学情境中可能有特定含义,此处若能结合更确切背景知识能使翻译更精准,但仅依据现有内容只能这样常规翻译。)