Kobak Kenneth A, Brown Brianne, Sharp Ian, Levy-Mack Hollie, Wells Kurrie, Ockun Felice, Williams Janet B W
MedAvante Research Institute, Madison, WI, USA.
J Clin Psychopharmacol. 2009 Feb;29(1):82-5. doi: 10.1097/JCP.0b013e318192e4d7.
Good interrater reliability is essential to minimize error variance and improve study power. Reasons why raters differ in scoring the same patient include information variance (different information obtained because of asking different questions), observation variance (the same information is obtained, but raters differ in what they notice and remember), interpretation variance (differences in the significance attached to what is observed), criterion variance (different criteria used to score items), and subject variance (true differences in the subject). We videotaped and transcribed 30 pairs of interviews to examine the most common sources of rater unreliability.
Thirty patients who experienced depression were independently interviewed by 2 different raters on the same day. Raters provided rationales for their scoring, and independent assessors reviewed the rationales, the interview transcripts, and the videotapes to code the main reason for each discrepancy. One third of the interviews were conducted by raters who had not administered the Hamilton Depression Rating Scale before; one third, by raters who were experienced but not calibrated; and one third, by experienced and calibrated raters.
Experienced and calibrated raters had the highest interrater reliability (intraclass correlation [ICC]; r = 0.93) followed by inexperienced raters (r = 0.77) and experienced but uncalibrated raters (r = 0.55). The most common reason for disagreement was interpretation variance (39%), followed by information variance (30%), criterion variance (27%), and observation variance (4%). Experienced and calibrated raters had significantly less criterion variance than the other cohorts (P = 0.001).
Reasons for disagreement varied by level of experience and calibration. Experienced and uncalibrated raters should focus on establishing common conventions, whereas experienced and calibrated raters should focus on fine tuning judgment calls on different thresholds of symptoms. Calibration training seems to improve reliability over experience alone. Experienced raters without cohort calibration had lower reliability than inexperienced raters.
良好的评分者间信度对于最小化误差方差和提高研究效能至关重要。评分者对同一患者评分存在差异的原因包括信息方差(因询问不同问题而获得不同信息)、观察方差(获得相同信息,但评分者在注意和记忆的内容上存在差异)、解释方差(对所观察内容赋予的意义不同)、标准方差(用于对项目评分的不同标准)以及个体方差(个体的真实差异)。我们对30对访谈进行了录像和转录,以检查评分者不可靠性最常见的来源。
30名患有抑郁症的患者在同一天由2名不同的评分者独立进行访谈。评分者为其评分提供理由,独立评估者审查这些理由、访谈记录和录像带,以对每个差异的主要原因进行编码。三分之一的访谈由之前未使用汉密尔顿抑郁量表的评分者进行;三分之一由有经验但未校准的评分者进行;三分之一由有经验且经过校准的评分者进行。
有经验且经过校准的评分者具有最高的评分者间信度(组内相关系数[ICC];r = 0.93),其次是无经验的评分者(r = 0.77)和有经验但未校准的评分者(r = 0.55)。意见不一致的最常见原因是解释方差(39%),其次是信息方差(30%)、标准方差(27%)和观察方差(4%)。有经验且经过校准的评分者的标准方差显著低于其他组(P = 0.001)。
意见不一致的原因因经验水平和校准情况而异。有经验但未校准的评分者应专注于建立共同的准则,而有经验且经过校准的评分者应专注于对不同症状阈值的判断进行微调。校准培训似乎比单纯的经验更能提高信度。未进行组内校准的有经验评分者的信度低于无经验评分者。