Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD 21205.
Department of Statistics, University of Washington, Seattle, WA 98195.
Proc Natl Acad Sci U S A. 2020 Dec 1;117(48):30266-30275. doi: 10.1073/pnas.2001238117. Epub 2020 Nov 18.
Many modern problems in medicine and public health leverage machine-learning methods to predict outcomes based on observable covariates. In a wide array of settings, predicted outcomes are used in subsequent statistical analysis, often without accounting for the distinction between observed and predicted outcomes. We call inference with predicted outcomes postprediction inference. In this paper, we develop methods for correcting statistical inference using outcomes predicted with arbitrarily complicated machine-learning models including random forests and deep neural nets. Rather than trying to derive the correction from first principles for each machine-learning algorithm, we observe that there is typically a low-dimensional and easily modeled representation of the relationship between the observed and predicted outcomes. We build an approach for postprediction inference that naturally fits into the standard machine-learning framework where the data are divided into training, testing, and validation sets. We train the prediction model in the training set, estimate the relationship between the observed and predicted outcomes in the testing set, and use that relationship to correct subsequent inference in the validation set. We show our postprediction inference (postpi) approach can correct bias and improve variance estimation and subsequent statistical inference with predicted outcomes. To show the broad range of applicability of our approach, we show postpi can improve inference in two distinct fields: modeling predicted phenotypes in repurposed gene expression data and modeling predicted causes of death in verbal autopsy data. Our method is available through an open-source R package: https://github.com/leekgroup/postpi.
许多现代医学和公共卫生问题都利用机器学习方法根据可观察的协变量预测结果。在广泛的环境中,预测结果用于后续的统计分析,而通常不考虑观察结果和预测结果之间的区别。我们称使用预测结果的推理为后预测推理。在本文中,我们开发了使用任意复杂的机器学习模型(包括随机森林和深度神经网络)预测结果进行校正统计推断的方法。我们没有试图为每种机器学习算法从原理上推导出纠正方法,而是观察到观察结果和预测结果之间通常存在一个低维且易于建模的关系表示。我们构建了一种后预测推理方法,该方法自然适用于标准的机器学习框架,其中数据分为训练集、测试集和验证集。我们在训练集中训练预测模型,在测试集中估计观察结果和预测结果之间的关系,并使用该关系在验证集中校正后续的推断。我们表明,我们的后预测推理(postpi)方法可以纠正偏差并改进方差估计和后续使用预测结果的统计推断。为了展示我们方法的广泛适用性,我们展示了 postpi 可以在两个不同领域改进推断:在重新利用的基因表达数据中对预测表型进行建模,以及在死因推断数据中对预测死因进行建模。我们的方法可通过一个开源的 R 包获得:https://github.com/leekgroup/postpi。