Institute for Clinical Research and Health Policy Studies at Tufts Medical Center, USA.
Brief Bioinform. 2011 May;12(3):189-202. doi: 10.1093/bib/bbq073. Epub 2011 Feb 7.
Proposed molecular classifiers may be overfit to idiosyncrasies of noisy genomic and proteomic data. Cross-validation methods are often used to obtain estimates of classification accuracy, but both simulations and case studies suggest that, when inappropriate methods are used, bias may ensue. Bias can be bypassed and generalizability can be tested by external (independent) validation. We evaluated 35 studies that have reported on external validation of a molecular classifier. We extracted information on study design and methodological features, and compared the performance of molecular classifiers in internal cross-validation versus external validation for 28 studies where both had been performed. We demonstrate that the majority of studies pursued cross-validation practices that are likely to overestimate classifier performance. Most studies were markedly underpowered to detect a 20% decrease in sensitivity or specificity between internal cross-validation and external validation [median power was 36% (IQR, 21-61%) and 29% (IQR, 15-65%), respectively]. The median reported classification performance for sensitivity and specificity was 94% and 98%, respectively, in cross-validation and 88% and 81% for independent validation. The relative diagnostic odds ratio was 3.26 (95% CI 2.04-5.21) for cross-validation versus independent validation. Finally, we reviewed all studies (n = 758) which cited those in our study sample, and identified only one instance of additional subsequent independent validation of these classifiers. In conclusion, these results document that many cross-validation practices employed in the literature are potentially biased and genuine progress in this field will require adoption of routine external validation of molecular classifiers, preferably in much larger studies than in current practice.
提出的分子分类器可能过度拟合于嘈杂的基因组和蛋白质组数据的特征。交叉验证方法通常用于获得分类准确性的估计,但模拟和案例研究都表明,当使用不适当的方法时,可能会出现偏差。通过外部(独立)验证可以避免偏差并测试可推广性。我们评估了 35 项报告分子分类器外部验证的研究。我们提取了关于研究设计和方法特征的信息,并比较了 28 项同时进行内部交叉验证和外部验证的研究中分子分类器的性能。我们证明,大多数研究采用的交叉验证实践很可能高估了分类器的性能。大多数研究在内部交叉验证和外部验证之间检测灵敏度或特异性降低 20%的能力明显不足[中位数效能分别为 36%(IQR,21%-61%)和 29%(IQR,15%-65%)]。报告的分类性能中位数为灵敏度和特异性分别为 94%和 98%,交叉验证和 88%和 81%,独立验证。交叉验证与独立验证的相对诊断比值比为 3.26(95%CI 2.04-5.21)。最后,我们回顾了所有引用我们研究样本中研究的(n=758)文献,并仅发现一次对这些分类器进行额外独立验证的实例。总之,这些结果表明,文献中使用的许多交叉验证实践可能存在偏差,而该领域的真正进展将需要采用分子分类器的常规外部验证,最好是在比当前实践更大的研究中进行。