Department of Statistics, University of Oxford, 1 South Parks Road, Oxford, OX1 3TG, United Kingdom.
J Proteome Res. 2011 Dec 2;10(12):5562-7. doi: 10.1021/pr200507b. Epub 2011 Nov 8.
In biomarker discovery studies, uncertainty associated with case and control labels is often overlooked. By omitting to take into account label uncertainty, model parameters and the predictive risk can become biased, sometimes severely. The most common situation is when the control set contains an unknown number of undiagnosed, or future, cases. This has a marked impact in situations where the model needs to be well-calibrated, e.g., when the prediction performance of a biomarker panel is evaluated. Failing to account for class label uncertainty may lead to underestimation of classification performance and bias in parameter estimates. This can further impact on meta-analysis for combining evidence from multiple studies. Using a simulation study, we outline how conventional statistical models can be modified to address class label uncertainty leading to well-calibrated prediction performance estimates and reduced bias in meta-analysis. We focus on the problem of mislabeled control subjects in case-control studies, i.e., when some of the control subjects are undiagnosed cases, although the procedures we report are generic. The uncertainty in control status is a particular situation common in biomarker discovery studies in the context of genomic and molecular epidemiology, where control subjects are commonly sampled from the general population with an established expected disease incidence rate.
在生物标志物发现研究中,与病例和对照标签相关的不确定性通常被忽视。如果不考虑标签不确定性,模型参数和预测风险可能会产生偏差,有时甚至会严重偏差。最常见的情况是,对照组包含未知数量的未确诊或未来的病例。在模型需要良好校准的情况下,例如评估生物标志物组合的预测性能时,这种情况会产生明显的影响。未能考虑类别标签不确定性可能导致分类性能的低估和参数估计的偏差。这会进一步影响来自多个研究的证据的荟萃分析。我们使用模拟研究概述了如何修改传统统计模型来解决类别标签不确定性问题,从而获得良好校准的预测性能估计值,并减少荟萃分析中的偏差。我们重点讨论病例对照研究中对照受试者标记错误的问题,即当一些对照受试者是未确诊的病例时,尽管我们报告的程序是通用的。在基因组学和分子流行病学的背景下,控制状态的不确定性是生物标志物发现研究中常见的特殊情况,其中对照受试者通常从具有既定预期疾病发病率的一般人群中抽取。