Gur David, Bandos Andriy I, King Jill L, Klym Amy H, Cohen Cathy S, Hakim Christiane M, Hardesty Lara A, Ganott Marie A, Perrin Ronald L, Poller William R, Shah Ratan, Sumkin Jules H, Wallace Luisa P, Rockette Howard E
Department of Radiology, University of Pittsburgh, Pittsburgh, Pennsylvania 15213, USA.
Med Phys. 2008 Oct;35(10):4404-9. doi: 10.1118/1.2977766.
The authors investigated radiologists, performances during retrospective interpretation of screening mammograms when using a binary decision whether to recall a woman for additional procedures or not and compared it with their receiver operating characteristic (ROC) type performance curves using a semi-continuous rating scale. Under an Institutional Review Board approved protocol nine experienced radiologists independently rated an enriched set of 155 examinations that they had not personally read in the clinic, mixed with other enriched sets of examinations that they had individually read in the clinic, using both a screening BI-RADS rating scale (recall/not recall) and a semi-continuous ROC type rating scale (0 to 100). The vertical distance, namely the difference in sensitivity levels at the same specificity levels, between the empirical ROC curve and the binary operating point were computed for each reader. The vertical distance averaged over all readers was used to assess the proximity of the performance levels under the binary and ROC-type rating scale. There does not appear to be any systematic tendency of the readers towards a better performance when using either of the two rating approaches, namely four readers performed better using the semi-continuous rating scale, four readers performed better with the binary scale, and one reader had the point exactly on the empirical ROC curve. Only one of the nine readers had a binary "operating point" that was statistically distant from the same reader's empirical ROC curve. Reader-specific differences ranged from -0.046 to 0.128 with an average width of the corresponding 95% confidence intervals of 0.2 and p-values ranging for individual readers from 0.050 to 0.966. On average, radiologists performed similarly when using the two rating scales in that the average distance between the run in individual reader's binary operating point and their ROC curve was close to zero. The 95% confidence interval for the fixed-reader average (0.016) was (-0.0206, 0.0631) (two-sided p-value 0.35). In conclusion the authors found that in retrospective observer performance studies the use of a binary response or a semi-continuous rating scale led to consistent results in terms of performance as measured by sensitivity-specificity operating points.
作者们调查了放射科医生在回顾性解读筛查乳腺X线照片时的表现,此时采用二元决策(即决定是否召回女性进行额外检查),并将其与使用半连续评分量表得出的接受者操作特征(ROC)类型的表现曲线进行比较。在机构审查委员会批准的方案下,九名经验丰富的放射科医生独立对一组丰富的155例检查进行评分,这些检查他们在诊所中并未亲自阅片,而是与他们在诊所中各自阅过的其他丰富检查集混合在一起,同时使用筛查BI-RADS评分量表(召回/不召回)和半连续ROC类型评分量表(0至100)。计算每个读者的经验ROC曲线与二元操作点之间的垂直距离,即在相同特异性水平下的灵敏度水平差异。所有读者的垂直距离平均值用于评估二元评分量表和ROC类型评分量表下表现水平的接近程度。当使用两种评分方法中的任何一种时,读者似乎都没有表现出朝着更好表现的任何系统倾向,即四名读者使用半连续评分量表表现更好,四名读者使用二元量表表现更好,一名读者的点恰好位于经验ROC曲线上。九名读者中只有一名的二元“操作点”与同一读者的经验ROC曲线在统计学上有显著差异。读者特异性差异范围为-0.046至0.128,相应95%置信区间的平均宽度为0.2,各个读者的p值范围为0.050至0.966。平均而言,放射科医生在使用两种评分量表时表现相似,因为各个读者的二元操作点与他们的ROC曲线之间的平均距离接近零。固定读者平均值的95%置信区间(0.016)为(-0.0206,0.0631)(双侧p值0.35)。总之,作者发现,在回顾性观察者表现研究中,使用二元反应或半连续评分量表在通过灵敏度-特异性操作点衡量的表现方面产生了一致的结果。