Josephson S Andrew, Hills Nancy K, Johnston S Claiborne
Stroke Sciences Group, Department of Neurology, University of California, San Francisco, CA 94143-0114, USA.
Cerebrovasc Dis. 2006;22(5-6):389-95. doi: 10.1159/000094857. Epub 2006 Aug 4.
The NIH Stroke Scale (NIHSS) is widely used in stroke clinical care and trials. Certification in its use, most commonly through rating of video vignettes, is routinely required. To investigate the reliability of the NIHSS in a representative sample of raters, we examined the results of the most frequently used certification examination.
At the invitation of the National Stroke Association, we analyzed the results of all raters who completed one of two multiple patient videotaped certification examinations from 1998 to 2004. Total scores for each vignette were calculated and ratings were compared based on percentile of responses and modified kappa scores.
There were 7,405 unique raters with 38,148 individual NIHSS item responses; median scores for each vignette ranged from 0 to 31. Total NIHSS scores varied widely between raters; scoring for 7 of the 11 patients (64%) had a four or more point difference in NIHSS score from the 5th to 95th percentile. The aphasia (kappa = 0.60) and facial palsy (0.65) items on the test contributed most to the variance in the total NIHSS score. Nurses agreed with the most common response on scoring more frequently than physicians (p < 0.0001). Taking the certification examination multiple times did not improve agreement.
In a large diverse sample of clinicians, inter-rater reliability for individual elements of the NIHSS on videotaped vignettes was generally good, but overall scoring was inconsistent and could impact clinical trial results. Whether additional training, modification of examination elements, or clearer definitions for scoring could improve reliability requires further study.
美国国立卫生研究院卒中量表(NIHSS)在卒中临床护理和试验中被广泛使用。通常需要通过对视频短片评分等方式进行使用认证。为了调查NIHSS在代表性评分者样本中的可靠性,我们检查了最常用认证考试的结果。
应美国国家卒中协会的邀请,我们分析了1998年至2004年期间完成两项多患者录像认证考试之一的所有评分者的结果。计算每个短片的总分,并根据回答百分位数和修正kappa分数比较评分。
共有7405名独立评分者,有38148条NIHSS单项回答;每个短片的中位数分数在0至31分之间。评分者之间的NIHSS总分差异很大;11名患者中有7名(64%)的评分在第5百分位数到第95百分位数之间的NIHSS分数相差4分或更多。测试中的失语(kappa = 0.60)和面瘫(0.65)项目对NIHSS总分差异的贡献最大。护士在评分上比医生更频繁地认同最常见的回答(p < 0.0001)。多次参加认证考试并没有提高一致性。
在一个多样化的临床医生大样本中,录像短片上NIHSS各个项目的评分者间信度总体良好,但总体评分不一致,可能会影响临床试验结果。额外的培训、考试内容的修改或更清晰的评分定义是否能提高信度,需要进一步研究。