校正两阶段病例对照研究中样本选择偏倚的分类器

Correcting Classifiers for Sample Selection Bias in Two-Phase Case-Control Studies.

作者信息

Krautenbacher Norbert, Theis Fabian J, Fuchs Christiane

机构信息

Institute of Computational Biology, Helmholtz Zentrum München, German Research Center for Environmental Health, Munich, Germany.

Department of Mathematics, Technische Universität München, Munich, Germany.

出版信息

Comput Math Methods Med. 2017;2017:7847531. doi: 10.1155/2017/7847531. Epub 2017 Sep 24.

DOI:10.1155/2017/7847531

PMID:29312464

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5632994/

Abstract

Epidemiological studies often utilize stratified data in which rare outcomes or exposures are artificially enriched. This design can increase precision in association tests but distorts predictions when applying classifiers on nonstratified data. Several methods correct for this so-called sample selection bias, but their performance remains unclear especially for machine learning classifiers. With an emphasis on two-phase case-control studies, we aim to assess which corrections to perform in which setting and to obtain methods suitable for machine learning techniques, especially the random forest. We propose two new resampling-based methods to resemble the original data and covariance structure: stochastic inverse-probability oversampling and parametric inverse-probability bagging. We compare all techniques for the random forest and other classifiers, both theoretically and on simulated and real data. Empirical results show that the random forest profits from only the parametric inverse-probability bagging proposed by us. For other classifiers, correction is mostly advantageous, and methods perform uniformly. We discuss consequences of inappropriate distribution assumptions and reason for different behaviors between the random forest and other classifiers. In conclusion, we provide guidance for choosing correction methods when training classifiers on biased samples. For random forests, our method outperforms state-of-the-art procedures if distribution assumptions are roughly fulfilled. We provide our implementation in the R package .

摘要

流行病学研究经常使用分层数据，其中罕见的结果或暴露被人为地富集。这种设计可以提高关联测试的精度，但在对未分层数据应用分类器时会扭曲预测。有几种方法可以校正这种所谓的样本选择偏差，但其性能仍不明确，尤其是对于机器学习分类器。重点关注两阶段病例对照研究，我们旨在评估在何种情况下应进行何种校正，并获得适用于机器学习技术（特别是随机森林）的方法。我们提出了两种基于重采样的新方法来模拟原始数据和协方差结构：随机逆概率过采样和参数逆概率装袋。我们在理论上以及在模拟数据和真实数据上比较了随机森林和其他分类器的所有技术。实证结果表明，随机森林仅从我们提出的参数逆概率装袋中受益。对于其他分类器，校正大多是有利的，并且方法表现一致。我们讨论了不适当分布假设的后果以及随机森林和其他分类器之间不同行为的原因。总之，我们为在有偏差样本上训练分类器时选择校正方法提供了指导。对于随机森林，如果分布假设大致满足，我们的方法优于现有技术程序。我们在R包中提供了我们的实现。