Department of Neurosurgery, Barrow Neurological Institute, St. Joseph's Hospital and Medical Center, Phoenix, Arizona.
J Neurosurg. 2017 May;126(5):1714-1719. doi: 10.3171/2016.3.JNS153044. Epub 2016 Jul 1.
OBJECTIVE The goal of this study was to determine the interrater and intrarater reliability of the Knosp grading scale for predicting pituitary adenoma cavernous sinus (CS) involvement. METHODS Six independent raters (3 neurosurgery residents, 2 pituitary surgeons, and 1 neuroradiologist) participated in the study. Each rater scored 50 unique pituitary MRI scans (with contrast) of biopsy-proven pituitary adenoma. Reliabilities for the full scale were determined 3 ways: 1) using all 50 scans, 2) using scans with midrange scores versus end scores, and 3) using a dichotomized scale that reflects common clinical practice. The performance of resident raters was compared with that of faculty raters to assess the influence of training level on reliability. RESULTS Overall, the interrater reliability of the Knosp scale was "strong" (0.73, 95% CI 0.56-0.84). However, the percent agreement for all 6 reviewers was only 10% (26% for faculty members, 30% for residents). The reliability of the middle scores (i.e., average rated Knosp Grades 1 and 2) was "very weak" (0.18, 95% CI -0.27 to 0.56) and the percent agreement for all reviewers was only 5%. When the scale was dichotomized into tumors unlikely to have intraoperative CS involvement (Grades 0, 1, and 2) and those likely to have CS involvement (Grades 3 and 4), the reliability was "strong" (0.60, 95% CI 0.39-0.75) and the percent agreement for all raters improved to 60%. There was no significant difference in reliability between residents and faculty (residents 0.72, 95% CI 0.55-0.83 vs faculty 0.73, 95% CI 0.56-0.84). Intrarater reliability was moderate to strong and increased with the level of experience. CONCLUSIONS Although these findings suggest that the Knosp grading scale has acceptable interrater reliability overall, it raises important questions about the "very weak" reliability of the scale's middle grades. By dichotomizing the scale into clinically useful groups, the authors were able to address the poor reliability and percent agreement of the intermediate grades and to isolate the most important grades for use in surgical decision making (Grades 3 and 4). Authors of future pituitary surgery studies should consider reporting Knosp grades as dichotomized results rather than as the full scale to optimize the reliability of the scale.
目的 本研究旨在确定 Knosp 分级系统预测垂体腺瘤海绵窦(CS)受累的组内和组间可靠性。
方法 6 名独立的评估者(3 名神经外科住院医师、2 名垂体外科医生和 1 名神经放射科医生)参与了这项研究。每位评估者对 50 例经活检证实的垂体腺瘤的 MRI 扫描(增强)进行评分。通过 3 种方法确定全量表的可靠性:1)使用所有 50 个扫描;2)使用中间评分与终点评分的扫描;3)使用反映临床常见实践的二分刻度。比较住院医师评估者的表现与教员评估者的表现,以评估培训水平对可靠性的影响。
结果 总体而言,Knosp 量表的组间可靠性为“强”(0.73,95%置信区间为 0.56-0.84)。然而,所有 6 名评估者的完全一致率仅为 10%(教员为 26%,住院医师为 30%)。中间评分(即平均评分为 Knosp 分级 1 和 2)的可靠性“非常弱”(0.18,95%置信区间为-0.27 至 0.56),所有评估者的完全一致率仅为 5%。当将量表分为术中 CS 受累可能性不大的肿瘤(分级 0、1 和 2)和可能有 CS 受累的肿瘤(分级 3 和 4)时,可靠性为“强”(0.60,95%置信区间为 0.39-0.75),所有评估者的完全一致率提高至 60%。住院医师和教员之间的可靠性无显著差异(住院医师 0.72,95%置信区间为 0.55-0.83;教员 0.73,95%置信区间为 0.56-0.84)。组内可靠性为中度至强,且随经验水平的提高而增加。
结论 尽管这些发现表明 Knosp 分级系统的组间可靠性总体上可接受,但它提出了有关该量表中间等级“非常弱”可靠性的重要问题。通过将量表分为临床上有用的组,可以解决中间等级的可靠性和完全一致率低的问题,并确定用于手术决策的最重要等级(等级 3 和 4)。未来垂体手术研究的作者应考虑将 Knosp 分级报告为二分结果,而不是全量表,以优化量表的可靠性。