基于机器学习预测结果进行推理的校正方法。

Methods for correcting inference based on outcomes predicted by machine learning.

机构信息

Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD 21205.

Department of Statistics, University of Washington, Seattle, WA 98195.

出版信息

Proc Natl Acad Sci U S A. 2020 Dec 1;117(48):30266-30275. doi: 10.1073/pnas.2001238117. Epub 2020 Nov 18.

DOI:10.1073/pnas.2001238117

PMID:33208538

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7720220/

Abstract

Many modern problems in medicine and public health leverage machine-learning methods to predict outcomes based on observable covariates. In a wide array of settings, predicted outcomes are used in subsequent statistical analysis, often without accounting for the distinction between observed and predicted outcomes. We call inference with predicted outcomes postprediction inference. In this paper, we develop methods for correcting statistical inference using outcomes predicted with arbitrarily complicated machine-learning models including random forests and deep neural nets. Rather than trying to derive the correction from first principles for each machine-learning algorithm, we observe that there is typically a low-dimensional and easily modeled representation of the relationship between the observed and predicted outcomes. We build an approach for postprediction inference that naturally fits into the standard machine-learning framework where the data are divided into training, testing, and validation sets. We train the prediction model in the training set, estimate the relationship between the observed and predicted outcomes in the testing set, and use that relationship to correct subsequent inference in the validation set. We show our postprediction inference (postpi) approach can correct bias and improve variance estimation and subsequent statistical inference with predicted outcomes. To show the broad range of applicability of our approach, we show postpi can improve inference in two distinct fields: modeling predicted phenotypes in repurposed gene expression data and modeling predicted causes of death in verbal autopsy data. Our method is available through an open-source R package: https://github.com/leekgroup/postpi.

摘要

许多现代医学和公共卫生问题都利用机器学习方法根据可观察的协变量预测结果。在广泛的环境中，预测结果用于后续的统计分析，而通常不考虑观察结果和预测结果之间的区别。我们称使用预测结果的推理为后预测推理。在本文中，我们开发了使用任意复杂的机器学习模型（包括随机森林和深度神经网络）预测结果进行校正统计推断的方法。我们没有试图为每种机器学习算法从原理上推导出纠正方法，而是观察到观察结果和预测结果之间通常存在一个低维且易于建模的关系表示。我们构建了一种后预测推理方法，该方法自然适用于标准的机器学习框架，其中数据分为训练集、测试集和验证集。我们在训练集中训练预测模型，在测试集中估计观察结果和预测结果之间的关系，并使用该关系在验证集中校正后续的推断。我们表明，我们的后预测推理（postpi）方法可以纠正偏差并改进方差估计和后续使用预测结果的统计推断。为了展示我们方法的广泛适用性，我们展示了 postpi 可以在两个不同领域改进推断：在重新利用的基因表达数据中对预测表型进行建模，以及在死因推断数据中对预测死因进行建模。我们的方法可通过一个开源的 R 包获得：https://github.com/leekgroup/postpi。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1af7/7720220/bbccd4a91670/pnas.2001238117fig01.jpg

相似文献

Methods for correcting inference based on outcomes predicted by machine learning.基于机器学习预测结果进行推理的校正方法。

Proc Natl Acad Sci U S A. 2020 Dec 1;117(48):30266-30275. doi: 10.1073/pnas.2001238117. Epub 2020 Nov 18.

Postprediction Inference for Clinical Characteristics Extracted With Machine Learning on Electronic Health Records.基于电子健康记录的机器学习提取的临床特征的后预测推断。

JCO Clin Cancer Inform. 2023 May;7:e2200174. doi: 10.1200/CCI.22.00174.

A decision-theoretic approach to the evaluation of machine learning algorithms in computational drug discovery.一种基于决策理论的计算药物发现中机器学习算法评估方法。

Bioinformatics. 2019 Nov 1;35(22):4656-4663. doi: 10.1093/bioinformatics/btz293.

A machine learning approach to predict early outcomes after pituitary adenoma surgery.一种用于预测垂体腺瘤手术后早期结果的机器学习方法。

Neurosurg Focus. 2018 Nov 1;45(5):E8. doi: 10.3171/2018.8.FOCUS18268.

MIDGET:Detecting differential gene expression on microarray data.MIDGET：检测微阵列数据中的差异基因表达。

Comput Methods Programs Biomed. 2021 Nov;211:106418. doi: 10.1016/j.cmpb.2021.106418. Epub 2021 Sep 16.

Reflection on modern methods: when worlds collide-prediction, machine learning and causal inference.反思现代方法：当世界碰撞——预测、机器学习和因果推断。

Int J Epidemiol. 2021 Jan 23;49(6):2058-2064. doi: 10.1093/ije/dyz132.

Part 1. Statistical Learning Methods for the Effects of Multiple Air Pollution Constituents.第1部分. 多种空气污染成分影响的统计学习方法

Res Rep Health Eff Inst. 2015 Jun(183 Pt 1-2):5-50.

A representation learning model based on variational inference and graph autoencoder for predicting lncRNA-disease associations.基于变分推理和图自动编码器的 lncRNA-疾病关联预测的表示学习模型。

BMC Bioinformatics. 2021 Mar 21;22(1):136. doi: 10.1186/s12859-021-04073-z.

Personal Health Information Inference Using Machine Learning on RNA Expression Data from Patients With Cancer: Algorithm Validation Study.利用癌症患者 RNA 表达数据进行机器学习的个人健康信息推断：算法验证研究。

J Med Internet Res. 2020 Aug 10;22(8):e18387. doi: 10.2196/18387.

Random Forests Approach for Causal Inference with Clustered Observational Data.随机森林方法在聚类观测数据中的因果推断。

Multivariate Behav Res. 2021 Nov-Dec;56(6):829-852. doi: 10.1080/00273171.2020.1808437. Epub 2020 Aug 28.

引用本文的文献

VARIANCE AS A PREDICTOR OF HEALTH OUTCOMES: SUBJECT-LEVEL TRAJECTORIES AND VARIABILITY OF SEX HORMONES TO PREDICT BODY FAT CHANGES IN PERI- AND POSTMENOPAUSAL WOMEN.作为健康结果预测指标的方差：个体水平轨迹以及性激素变异性对绝经前后女性体脂变化的预测作用

Ann Appl Stat. 2024 Jun;18(2):1642-1667. doi: 10.1214/23-aoas1852. Epub 2024 Apr 5.

Rhythm profiling using COFE reveals multi-omic circadian rhythms in human cancers in vivo.使用COFE进行节律分析揭示了人类癌症体内的多组学昼夜节律。

PLoS Biol. 2025 May 27;23(5):e3003196. doi: 10.1371/journal.pbio.3003196. eCollection 2025 May.

ipd: an R package for conducting inference on predicted data.ipd：一个用于对预测数据进行推断的R软件包。

Bioinformatics. 2025 Feb 4;41(2). doi: 10.1093/bioinformatics/btaf055.

A Bayesian Approach to Modeling Variance of Intensive Longitudinal Biomarker Data as a Predictor of Health Outcomes.一种将密集纵向生物标志物数据的方差建模为健康结果预测指标的贝叶斯方法。

Stat Med. 2024 Dec 30;43(30):5748-5764. doi: 10.1002/sim.10281. Epub 2024 Nov 14.

Valid inference for machine learning-assisted genome-wide association studies.机器学习辅助全基因组关联研究的有效推论。

Nat Genet. 2024 Nov;56(11):2361-2369. doi: 10.1038/s41588-024-01934-0. Epub 2024 Sep 30.

Synthetic surrogates improve power for genome-wide association studies of partially missing phenotypes in population biobanks.合成替代物可提高在人群生物库中对部分缺失表型进行全基因组关联研究的功效。

Nat Genet. 2024 Jul;56(7):1527-1536. doi: 10.1038/s41588-024-01793-9. Epub 2024 Jun 13.

Cross-prediction-powered inference.交叉预测驱动的推理。

Proc Natl Acad Sci U S A. 2024 Apr 9;121(15):e2322083121. doi: 10.1073/pnas.2322083121. Epub 2024 Apr 3.

Current and Future Technologies for the Detection of Antibiotic-Resistant Bacteria.检测抗生素耐药细菌的当前及未来技术

Diagnostics (Basel). 2023 Oct 18;13(20):3246. doi: 10.3390/diagnostics13203246.

Protein remote homology detection and structural alignment using deep learning.使用深度学习进行蛋白质远程同源检测和结构比对。

Nat Biotechnol. 2024 Jun;42(6):975-985. doi: 10.1038/s41587-023-01917-2. Epub 2023 Sep 7.

distillation of thermodynamic affinity from deep learning regulatory sequence models of protein-DNA binding.从蛋白质 - DNA 结合的深度学习调控序列模型中提取热力学亲和力

bioRxiv. 2023 May 11:2023.05.11.540401. doi: 10.1101/2023.05.11.540401.

本文引用的文献

Regularized Bayesian transfer learning for population-level etiological distributions.基于正则化贝叶斯迁移学习的人群病因分布研究。

Biostatistics. 2021 Oct 13;22(4):836-857. doi: 10.1093/biostatistics/kxaa001.

Scalable and accurate deep learning with electronic health records.借助电子健康记录实现可扩展且准确的深度学习。

NPJ Digit Med. 2018 May 8;1:18. doi: 10.1038/s41746-018-0029-1. eCollection 2018.

A transcriptome-wide association study of high-grade serous epithelial ovarian cancer identifies new susceptibility genes and splice variants.一项高级别浆液性上皮性卵巢癌的转录组关联研究确定了新的易感性基因和剪接变异体。

Nat Genet. 2019 May;51(5):815-823. doi: 10.1038/s41588-019-0395-x. Epub 2019 May 1.

Machine learning identifies interacting genetic variants contributing to breast cancer risk: A case study in Finnish cases and controls.机器学习鉴定出导致乳腺癌风险的相互作用遗传变异：芬兰病例对照研究。

Sci Rep. 2018 Sep 3;8(1):13149. doi: 10.1038/s41598-018-31573-5.

Opportunities and obstacles for deep learning in biology and medicine.深度学习在生物学和医学中的机遇与挑战。

J R Soc Interface. 2018 Apr;15(141). doi: 10.1098/rsif.2017.0387.

Improving the value of public RNA-seq expression data by phenotype prediction.通过表型预测提高公共 RNA-seq 表达数据的价值。

Nucleic Acids Res. 2018 May 18;46(9):e54. doi: 10.1093/nar/gky102.

Genetic effects on gene expression across human tissues.基因对人体各组织基因表达的影响。

Nature. 2017 Oct 11;550(7675):204-213. doi: 10.1038/nature24277.

Reproducible RNA-seq analysis using recount2.使用recount2进行可重复的RNA测序分析。

Nat Biotechnol. 2017 Apr 11;35(4):319-321. doi: 10.1038/nbt.3838.

Case-control association mapping by proxy using family history of disease.基于疾病家族史的代理病例对照关联映射。

Nat Genet. 2017 Mar;49(3):325-331. doi: 10.1038/ng.3766. Epub 2017 Jan 16.

Probabilistic Cause-of-death Assignment using Verbal Autopsies.使用死因推断进行概率性死因分配

J Am Stat Assoc. 2016;111(515):1036-1049. doi: 10.1080/01621459.2016.1152191. Epub 2016 Oct 18.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

基于机器学习预测结果进行推理的校正方法。

Methods for correcting inference based on outcomes predicted by machine learning.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献