Xie Shuilian, Braga-Neto Ulisses M
Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX, USA.
Cancer Inform. 2019 Jul 15;18:1176935119860822. doi: 10.1177/1176935119860822. eCollection 2019.
Observational case-control studies for biomarker discovery in cancer studies often collect data that are sampled separately from the case and control populations. We present an analysis of the bias in the estimation of the precision of classifiers designed on separately sampled data. The analysis consists of both theoretical and numerical results, which show that classifier precision estimates can display strong bias under separating sampling, with the bias magnitude depending on the difference between the true case prevalence in the population and the sample prevalence in the data. We show that this bias is systematic in the sense that it cannot be reduced by increasing sample size. If information about the true case prevalence is available from public health records, then a modified precision estimator that uses the known prevalence displays smaller bias, which can in fact be reduced to zero as sample size increases under regularity conditions on the classification algorithm. The accuracy of the theoretical analysis and the performance of the precision estimators under separate sampling are confirmed by numerical experiments using synthetic and real data from published observational case-control studies. The results with real data confirmed that under separately sampled data, the usual estimator produces larger, ie, more optimistic, precision estimates than the estimator using the true prevalence value.
癌症研究中用于生物标志物发现的观察性病例对照研究通常收集从病例和对照人群中分别抽样的数据。我们对基于分别抽样数据设计的分类器精度估计中的偏差进行了分析。该分析包括理论和数值结果,结果表明,在分别抽样的情况下,分类器精度估计可能会显示出强烈的偏差,偏差大小取决于人群中真实病例患病率与数据中样本患病率之间的差异。我们表明,这种偏差是系统性的,即无法通过增加样本量来减少。如果可从公共卫生记录中获得真实病例患病率的信息,那么使用已知患病率的修正精度估计器显示出较小的偏差,实际上在分类算法的正则条件下,随着样本量增加,偏差可降至零。使用已发表观察性病例对照研究的合成数据和真实数据进行的数值实验证实了理论分析的准确性以及分别抽样下精度估计器的性能。真实数据的结果证实,在分别抽样的数据下,常用估计器产生的精度估计比使用真实患病率值的估计器更大,即更乐观。