可扩展的高维数据协作靶向学习。

Scalable collaborative targeted learning for high-dimensional data.

机构信息

1 University of California, Berkeley, CA, USA.

2 Harvard Pilgrim Health Care Institute and Harvard Medical School, Boston, MA, USA.

出版信息

Stat Methods Med Res. 2019 Feb;28(2):532-554. doi: 10.1177/0962280217729845. Epub 2017 Sep 22.

DOI:10.1177/0962280217729845

PMID:28936917

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6086775/

Abstract

Robust inference of a low-dimensional parameter in a large semi-parametric model relies on external estimators of infinite-dimensional features of the distribution of the data. Typically, only one of the latter is optimized for the sake of constructing a well-behaved estimator of the low-dimensional parameter of interest. Optimizing more than one of them for the sake of achieving a better bias-variance trade-off in the estimation of the parameter of interest is the core idea driving the general template of the collaborative targeted minimum loss-based estimation procedure. The original instantiation of the collaborative targeted minimum loss-based estimation template can be presented as a greedy forward stepwise collaborative targeted minimum loss-based estimation algorithm. It does not scale well when the number p of covariates increases drastically. This motivates the introduction of a novel instantiation of the collaborative targeted minimum loss-based estimation template where the covariates are pre-ordered. Its time complexity is as opposed to the original , a remarkable gain. We propose two pre-ordering strategies and suggest a rule of thumb to develop other meaningful strategies. Because it is usually unclear a priori which pre-ordering strategy to choose, we also introduce another instantiation called SL-C-TMLE algorithm that enables the data-driven choice of the better pre-ordering strategy given the problem at hand. Its time complexity is as well. The computational burden and relative performance of these algorithms were compared in simulation studies involving fully synthetic data or partially synthetic data based on a real world large electronic health database; and in analyses of three real, large electronic health databases. In all analyses involving electronic health databases, the greedy collaborative targeted minimum loss-based estimation algorithm is unacceptably slow. Simulation studies seem to indicate that our scalable collaborative targeted minimum loss-based estimation and SL-C-TMLE algorithms work well. All C-TMLEs are publicly available in a Julia software package.

摘要

在大型半参数模型中，稳健地推断低维参数依赖于数据分布的无限维特征的外部估计量。通常，仅优化后者之一，以便构建感兴趣的低维参数的行为良好的估计量。为了在感兴趣的参数的估计中实现更好的偏差方差权衡，优化多个特征是驱动协作靶向最小损失估计过程的通用模板的核心思想。协作靶向最小损失估计模板的原始实例可以表示为贪婪向前逐步协作靶向最小损失估计算法。当协变量的数量 p 急剧增加时，它的规模不会很好。这促使我们引入了协作靶向最小损失估计模板的新实例，其中协变量是预排序的。它的时间复杂度为，而原始的，这是一个显著的改进。我们提出了两种预排序策略，并建议了一种规则来开发其他有意义的策略。因为通常不清楚应该选择哪种预排序策略，所以我们还引入了另一种称为 SL-C-TMLE 的实例化方法，该方法可以根据手头的问题，实现更好的预排序策略的数据驱动选择。它的时间复杂度也是。在涉及完全合成数据或基于真实世界大型电子健康数据库的部分合成数据的模拟研究中，以及在对三个真实的大型电子健康数据库的分析中，比较了这些算法的计算负担和相对性能。在所有涉及电子健康数据库的分析中，贪婪协作靶向最小损失估计算法的速度都不可接受。模拟研究似乎表明，我们的可扩展协作靶向最小损失估计和 SL-C-TMLE 算法运行良好。所有 C-TMLE 都可以在 Julia 软件包中公开获得。

相似文献

Scalable collaborative targeted learning for high-dimensional data.

Stat Methods Med Res. 2019 Feb;28(2):532-554. doi: 10.1177/0962280217729845. Epub 2017 Sep 22.

Collaborative-controlled LASSO for constructing propensity score-based estimators in high-dimensional data.

Stat Methods Med Res. 2019 Apr;28(4):1044-1063. doi: 10.1177/0962280217744588. Epub 2017 Dec 11.

Collaborative double robust targeted maximum likelihood estimation.

Int J Biostat. 2010 May 17;6(1):Article 17. doi: 10.2202/1557-4679.1181.

Targeted estimation of nuisance parameters to obtain valid statistical inference.

Int J Biostat. 2014;10(1):29-57. doi: 10.1515/ijb-2012-0038.

Collaborative targeted learning using regression shrinkage.

Stat Med. 2018 Feb 20;37(4):530-543. doi: 10.1002/sim.7527. Epub 2017 Nov 2.

Double Robust Efficient Estimators of Longitudinal Treatment Effects: Comparative Performance in Simulations and a Case Study.

Int J Biostat. 2019 Feb 26;15(2):/j/ijb.2019.15.issue-2/ijb-2017-0054/ijb-2017-0054.xml. doi: 10.1515/ijb-2017-0054.

Targeted learning in real-world comparative effectiveness research with time-varying interventions.

Stat Med. 2014 Jun 30;33(14):2480-520. doi: 10.1002/sim.6099. Epub 2014 Feb 17.

Data-adaptive longitudinal model selection in causal inference with collaborative targeted minimum loss-based estimation.

Biometrics. 2020 Mar;76(1):145-157. doi: 10.1111/biom.13135. Epub 2019 Nov 6.

Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.

Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.

引用本文的文献

How Effective Are Machine Learning and Doubly Robust Estimators in Incorporating High-Dimensional Proxies to Reduce Residual Confounding?

Pharmacoepidemiol Drug Saf. 2025 May;34(5):e70155. doi: 10.1002/pds.70155.

Stan and BART for Causal Inference: Estimating Heterogeneous Treatment Effects Using the Power of Stan and the Flexibility of Machine Learning.

Entropy (Basel). 2022 Dec 6;24(12):1782. doi: 10.3390/e24121782.

High-dimensional propensity scores for empirical covariate selection in secondary database studies: Planning, implementation, and reporting.

Pharmacoepidemiol Drug Saf. 2023 Feb;32(2):93-106. doi: 10.1002/pds.5566. Epub 2022 Nov 22.

Machine learning for improving high-dimensional proxy confounder adjustment in healthcare database studies: An overview of the current literature.

Pharmacoepidemiol Drug Saf. 2022 Sep;31(9):932-943. doi: 10.1002/pds.5500. Epub 2022 Jul 5.

Synthetic Negative Controls: Using Simulation to Screen Large-scale Propensity Score Analyses.

Epidemiology. 2022 Jul 1;33(4):541-550. doi: 10.1097/EDE.0000000000001482. Epub 2022 Apr 12.

Evaluating the robustness of targeted maximum likelihood estimators via realistic simulations in nutrition intervention trials.

Stat Med. 2022 May 30;41(12):2132-2165. doi: 10.1002/sim.9348. Epub 2022 Feb 16.

A comparison of confounder selection and adjustment methods for estimating causal effects using large healthcare databases.

Pharmacoepidemiol Drug Saf. 2022 Apr;31(4):424-433. doi: 10.1002/pds.5403. Epub 2022 Jan 7.

Analyses of child cardiometabolic phenotype following assisted reproductive technologies using a pragmatic trial emulation approach.

Nat Commun. 2021 Sep 23;12(1):5613. doi: 10.1038/s41467-021-25899-4.

When Can Nonrandomized Studies Support Valid Inference Regarding Effectiveness or Safety of New Medical Treatments?

Clin Pharmacol Ther. 2022 Jan;111(1):108-115. doi: 10.1002/cpt.2255. Epub 2021 May 9.

Propensity score prediction for electronic healthcare databases using Super Learner and High-dimensional Propensity Score Methods.

J Appl Stat. 2019;46(12):2216-2236. doi: 10.1080/02664763.2019.1582614. Epub 2019 Feb 22.

本文引用的文献

Propensity score prediction for electronic healthcare databases using Super Learner and High-dimensional Propensity Score Methods.

J Appl Stat. 2019;46(12):2216-2236. doi: 10.1080/02664763.2019.1582614. Epub 2019 Feb 22.

Comparison of high-dimensional confounder summary scores in comparative studies of newly marketed medications.

J Clin Epidemiol. 2016 Aug;76:200-8. doi: 10.1016/j.jclinepi.2016.02.011. Epub 2016 Feb 27.

Regularized Regression Versus the High-Dimensional Propensity Score for Confounding Adjustment in Secondary Database Analyses.

Am J Epidemiol. 2015 Oct 1;182(7):651-9. doi: 10.1093/aje/kwv108. Epub 2015 Aug 1.

Plasmode simulation for the evaluation of pharmacoepidemiologic methods in complex healthcare databases.

Comput Stat Data Anal. 2014 Apr;72:219-226. doi: 10.1016/j.csda.2013.10.018.

Studies with many covariates and few outcomes: selecting covariates and implementing propensity-score-based confounding adjustments.

Epidemiology. 2014 Mar;25(2):268-78. doi: 10.1097/EDE.0000000000000069.

Using high-dimensional propensity scores to automate confounding control in a distributed medical product safety surveillance system.

Pharmacoepidemiol Drug Saf. 2012 Jan;21 Suppl 1:41-9. doi: 10.1002/pds.2328.

Targeted maximum likelihood based causal inference: Part I.

Int J Biostat. 2010;6(2):Article 2. doi: 10.2202/1557-4679.1211.

Collaborative targeted maximum likelihood for time to event data.

Int J Biostat. 2010;6(1):Article 21. doi: 10.2202/1557-4679.1249.

The relative performance of targeted maximum likelihood estimators.

Int J Biostat. 2011;7(1). doi: 10.2202/1557-4679.1308. Epub 2011 Aug 17.

An application of collaborative targeted maximum likelihood estimation in causal inference and genomics.

Int J Biostat. 2010;6(1):Article 18. doi: 10.2202/1557-4679.1182. Epub 2010 May 17.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

可扩展的高维数据协作靶向学习。

Scalable collaborative targeted learning for high-dimensional data.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献