Hou Jue, Guo Zijian, Cai Tianxi
Division of Biostatistics, University of Minnesota School of Public Health, Minneapolis, MN 55455, USA.
Department of Statistics, Rutgers University, Piscataway, NJ 08854-8019, USA.
J Mach Learn Res. 2023 Jan-Dec;24.
Risk modeling with electronic health records (EHR) data is challenging due to no direct observations of the disease outcome and the high-dimensional predictors. In this paper, we develop a surrogate assisted semi-supervised learning approach, leveraging small labeled data with annotated outcomes and extensive unlabeled data of outcome surrogates and high-dimensional predictors. We propose to impute the unobserved outcomes by constructing a sparse imputation model with outcome surrogates and high-dimensional predictors. We further conduct a one-step bias correction to enable interval estimation for the risk prediction. Our inference procedure is valid even if both the imputation and risk prediction models are misspecified. Our novel way of ultilizing unlabelled data enables the high-dimensional statistical inference for the challenging setting with a dense risk prediction model. We present an extensive simulation study to demonstrate the superiority of our approach compared to existing supervised methods. We apply the method to genetic risk prediction of type-2 diabetes mellitus using an EHR biobank cohort.
利用电子健康记录(EHR)数据进行风险建模具有挑战性,这是由于无法直接观察疾病结局以及预测变量具有高维度性。在本文中,我们开发了一种替代辅助半监督学习方法,利用带有注释结局的少量标记数据以及结局替代指标和高维度预测变量的大量未标记数据。我们建议通过构建一个包含结局替代指标和高维度预测变量的稀疏插补模型来插补未观察到的结局。我们进一步进行一步偏差校正,以实现风险预测的区间估计。即使插补模型和风险预测模型都设定错误,我们的推断过程仍然有效。我们利用未标记数据的新颖方法能够在具有密集风险预测模型的具有挑战性的环境中进行高维统计推断。我们进行了广泛的模拟研究,以证明我们的方法相对于现有监督方法的优越性。我们将该方法应用于使用EHR生物样本队列对2型糖尿病进行遗传风险预测。