Department of Statistics, University College Cork, Cork, Ireland.
Med Phys. 2010 Apr;37(4):1788-95. doi: 10.1118/1.3352687.
The authors examine potential bias when using a reference reader panel as "gold standard" for estimating operating characteristics of CAD algorithms for detecting lesions. As an alternative, the authors propose latent class analysis (LCA), which does not require an external gold standard to evaluate diagnostic accuracy.
A binomial model for multiple reader detections using different diagnostic protocols was constructed, assuming conditional independence of readings given true lesion status. Operating characteristics of all protocols were estimated by maximum likelihood LCA. Reader panel and LCA based estimates were compared using data simulated from the binomial model for a range of operating characteristics. LCA was applied to 36 thin section thoracic computed tomography data sets from the Lung Image Database Consortium (LIDC): Free search markings of four radiologists were compared to markings from four different CAD assisted radiologists. For real data, bootstrap-based resampling methods, which accommodate dependence in reader detections, are proposed to test of hypotheses of differences between detection protocols.
In simulation studies, reader panel based sensitivity estimates had an average relative bias (ARB) of -23% to -27%, significantly higher (p-value < 0.0001) than LCA (ARB--2% to -6%). Specificity was well estimated by both reader panel (ARB -0.6% to -0.5%) and LCA (ARB 1.4%-0.5%). Among 1145 lesion candidates LIDC considered, LCA estimated sensitivity of reference readers (55%) was significantly lower (p-value 0.006) than CAD assisted readers' (68%). Average false positives per patient for reference readers (0.95) was not significantly lower (p-value 0.28) than CAD assisted readers' (1.27).
Whereas a gold standard based on a consensus of readers may substantially bias sensitivity estimates, LCA may be a significantly more accurate and consistent means for evaluating diagnostic accuracy.
作者研究了在使用参考读者小组作为 CAD 算法检测病变的性能的“金标准”时,潜在的偏倚。作为替代方法,作者提出了潜在类别分析(LCA),它不需要外部金标准来评估诊断准确性。
构建了一个用于使用不同诊断方案的多位读者检测的二项式模型,假设读取结果条件独立于真实病变状态。通过最大似然 LCA 估计所有方案的操作特性。使用来自二项式模型的数据比较读者小组和基于 LCA 的估计,该数据模拟了一系列操作特性。将 LCA 应用于来自 Lung Image Database Consortium(LIDC)的 36 个胸部薄层 CT 数据集:比较了四位放射科医生的自由搜索标记与四位不同 CAD 辅助放射科医生的标记。对于真实数据,提出了基于引导的重采样方法,该方法适用于读者检测的依赖性,用于测试检测方案之间的差异的假设。
在模拟研究中,基于读者小组的敏感性估计具有平均相对偏差(ARB)为-23%至-27%,明显高于 LCA(ARB-2%至-6%)。读者小组(ARB-0.6%至-0.5%)和 LCA(ARB 1.4%-0.5%)都很好地估计了特异性。在 LIDC 考虑的 1145 个病变候选者中,LCA 估计的参考读者敏感性(55%)明显低于 CAD 辅助读者的敏感性(68%)(p 值<0.0001)。参考读者的平均每个患者假阳性(0.95)与 CAD 辅助读者的假阳性(1.27)没有显著差异(p 值 0.28)。
虽然基于读者共识的金标准可能会极大地偏倚敏感性估计,但 LCA 可能是评估诊断准确性的更准确和一致的方法。