• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

在双重交叉拟合目标最大似然估计器中寻找最优分割数和重复次数

Finding the Optimal Number of Splits and Repetitions in Double Cross-Fitting Targeted Maximum Likelihood Estimators.

作者信息

Karim Mohammad Ehsanul, Mondol Momenul Haque

机构信息

School of Population and Public Health, University of British Columbia, Vancouver, British Columbia, Canada.

Centre for Advancing Health Outcomes, University of British Columbia, Vancouver, British Columbia, Canada.

出版信息

Pharm Stat. 2025 Sep-Oct;24(5):e70022. doi: 10.1002/pst.70022.

DOI:10.1002/pst.70022
PMID:40935595
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12425639/
Abstract

Flexible machine learning algorithms are increasingly utilized in real-world data analyses. When integrated within double robust methods, such as the Targeted Maximum Likelihood Estimator (TMLE), complex estimators can result in significant undercoverage-an issue that is even more pronounced in singly robust methods. The Double Cross-Fitting (DCF) procedure complements these methods by enabling the use of diverse machine learning estimators, yet optimal guidelines for the number of data splits and repetitions remain unclear. This study aims to explore the effects of varying the number of splits and repetitions in DCF on TMLE estimators through statistical simulations and a data analysis. We discuss two generalizations of DCF beyond the conventional three splits and apply a range of splits to fit the TMLE estimator, incorporating a super learner without transforming covariates. The statistical properties of these configurations are compared across two sample sizes (3000 and 5000) and two DCF generalizations (equal splits and full data use). Additionally, we conduct a real-world analysis using data from the National Health and Nutrition Examination Survey (NHANES) 2017-18 cycle to illustrate the practical implications of varying DCF splits, focusing on the association between obesity and the risk of developing diabetes. Our simulation study reveals that five splits in DCF yield satisfactory bias, variance, and coverage across scenarios. In the real-world application, the DCF TMLE method showed consistent risk difference estimates over a range of splits, though standard errors increased with more splits in one generalization, suggesting potential drawbacks to excessive splitting. This research underscores the importance of judicious selection of the number of splits and repetitions in DCF TMLE methods to achieve a balance between computational efficiency and accurate statistical inference. Optimal performance seems attainable with three to five splits. Among the generalizations considered, using full data for nuisance estimation offered more consistent variance estimation and is preferable for applied use. Additionally, increasing the repetitions beyond 25 did not enhance performance, providing crucial guidance for researchers employing complex machine learning algorithms in causal studies and advocating for cautious split management in DCF procedures.

摘要

灵活的机器学习算法在实际数据分析中越来越多地被使用。当集成到双重稳健方法中时,例如靶向最大似然估计器(TMLE),复杂的估计器可能会导致显著的覆盖不足——这个问题在单重稳健方法中更为明显。双重交叉拟合(DCF)程序通过允许使用多种机器学习估计器来补充这些方法,但关于数据分割和重复次数的最佳指导原则仍不明确。本研究旨在通过统计模拟和数据分析,探讨DCF中分割和重复次数的变化对TMLE估计器的影响。我们讨论了DCF在传统的三次分割之外的两种推广,并应用一系列分割来拟合TMLE估计器,纳入了一个不转换协变量的超级学习者。在两个样本量(3000和5000)和两种DCF推广(等分割和全数据使用)下比较了这些配置的统计特性。此外,我们使用2017 - 18年国家健康与营养检查调查(NHANES)周期的数据进行了实际分析,以说明DCF分割变化的实际影响,重点关注肥胖与患糖尿病风险之间的关联。我们的模拟研究表明,DCF中的五次分割在各种情况下产生了令人满意的偏差、方差和覆盖率。在实际应用中,DCF TMLE方法在一系列分割中显示出一致的风险差异估计,尽管在一种推广中标准误差随着分割次数的增加而增加,这表明过度分割存在潜在缺点。这项研究强调了在DCF TMLE方法中明智选择分割和重复次数的重要性,以在计算效率和准确的统计推断之间取得平衡。三到五次分割似乎能达到最佳性能。在所考虑的推广中,使用全数据进行干扰估计提供了更一致的方差估计,并且更适合实际应用。此外,将重复次数增加到25次以上并没有提高性能,这为在因果研究中使用复杂机器学习算法的研究人员提供了关键指导,并倡导在DCF程序中谨慎进行分割管理。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0a70/12425639/11145bae8e79/PST-24-0-g008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0a70/12425639/f58541be6f7d/PST-24-0-g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0a70/12425639/abc3e8d225e9/PST-24-0-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0a70/12425639/557420412051/PST-24-0-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0a70/12425639/c12c6df63d9e/PST-24-0-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0a70/12425639/75edcdb79aa9/PST-24-0-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0a70/12425639/c65a89bfca6e/PST-24-0-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0a70/12425639/560b20669bb6/PST-24-0-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0a70/12425639/769232ce056c/PST-24-0-g009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0a70/12425639/11145bae8e79/PST-24-0-g008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0a70/12425639/f58541be6f7d/PST-24-0-g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0a70/12425639/abc3e8d225e9/PST-24-0-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0a70/12425639/557420412051/PST-24-0-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0a70/12425639/c12c6df63d9e/PST-24-0-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0a70/12425639/75edcdb79aa9/PST-24-0-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0a70/12425639/c65a89bfca6e/PST-24-0-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0a70/12425639/560b20669bb6/PST-24-0-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0a70/12425639/769232ce056c/PST-24-0-g009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0a70/12425639/11145bae8e79/PST-24-0-g008.jpg

相似文献

1
Finding the Optimal Number of Splits and Repetitions in Double Cross-Fitting Targeted Maximum Likelihood Estimators.在双重交叉拟合目标最大似然估计器中寻找最优分割数和重复次数
Pharm Stat. 2025 Sep-Oct;24(5):e70022. doi: 10.1002/pst.70022.
2
Prescription of Controlled Substances: Benefits and Risks管制药品的处方:益处与风险
3
Performance of Cross-Validated Targeted Maximum Likelihood Estimation.交叉验证的靶向最大似然估计的性能
Stat Med. 2025 Jul;44(15-17):e70185. doi: 10.1002/sim.70185.
4
Plug-and-play use of tree-based methods: consequences for clinical prediction modeling.基于树的方法的即插即用:对临床预测模型的影响。
J Clin Epidemiol. 2025 Aug;184:111834. doi: 10.1016/j.jclinepi.2025.111834. Epub 2025 May 19.
5
Does the Presence of Missing Data Affect the Performance of the SORG Machine-learning Algorithm for Patients With Spinal Metastasis? Development of an Internet Application Algorithm.缺失数据的存在是否会影响 SORG 机器学习算法在脊柱转移瘤患者中的性能?开发一种互联网应用算法。
Clin Orthop Relat Res. 2024 Jan 1;482(1):143-157. doi: 10.1097/CORR.0000000000002706. Epub 2023 Jun 12.
6
Are Current Survival Prediction Tools Useful When Treating Subsequent Skeletal-related Events From Bone Metastases?当前的生存预测工具在治疗骨转移后的骨骼相关事件时有用吗?
Clin Orthop Relat Res. 2024 Sep 1;482(9):1710-1721. doi: 10.1097/CORR.0000000000003030. Epub 2024 Mar 22.
7
Comparison of Two Modern Survival Prediction Tools, SORG-MLA and METSSS, in Patients With Symptomatic Long-bone Metastases Who Underwent Local Treatment With Surgery Followed by Radiotherapy and With Radiotherapy Alone.两种现代生存预测工具 SORG-MLA 和 METSSS 在接受手术联合放疗和单纯放疗治疗有症状长骨转移患者中的比较。
Clin Orthop Relat Res. 2024 Dec 1;482(12):2193-2208. doi: 10.1097/CORR.0000000000003185. Epub 2024 Jul 23.
8
Healthcare outcomes assessed with observational study designs compared with those assessed in randomized trials.与随机试验中评估的医疗保健结果相比,观察性研究设计评估的医疗保健结果。
Cochrane Database Syst Rev. 2014 Apr 29;2014(4):MR000034. doi: 10.1002/14651858.MR000034.pub2.
9
Post-pandemic planning for maternity care for local, regional, and national maternity systems across the four nations: a mixed-methods study.针对四个地区的地方、区域和国家孕产妇保健系统的疫情后规划:一项混合方法研究。
Health Soc Care Deliv Res. 2025 Sep;13(35):1-25. doi: 10.3310/HHTE6611.
10
Aspects of Genetic Diversity, Host Specificity and Public Health Significance of Single-Celled Intestinal Parasites Commonly Observed in Humans and Mostly Referred to as 'Non-Pathogenic'.人类常见且大多被称为“非致病性”的单细胞肠道寄生虫的遗传多样性、宿主特异性及公共卫生意义
APMIS. 2025 Sep;133(9):e70036. doi: 10.1111/apm.70036.

本文引用的文献

1
Performance of Cross-Validated Targeted Maximum Likelihood Estimation.交叉验证的靶向最大似然估计的性能
Stat Med. 2025 Jul;44(15-17):e70185. doi: 10.1002/sim.70185.
2
How Effective Are Machine Learning and Doubly Robust Estimators in Incorporating High-Dimensional Proxies to Reduce Residual Confounding?在纳入高维代理变量以减少残余混杂方面,机器学习和双重稳健估计器的效果如何?
Pharmacoepidemiol Drug Saf. 2025 May;34(5):e70155. doi: 10.1002/pds.70155.
3
Towards Robust Causal Inference in Epidemiological Research: Employing Double Cross-fit TMLE in Right Heart Catheterization Data.
迈向流行病学研究中的稳健因果推断:在右心导管插入术数据中应用双重交叉拟合全最大似然估计法
Am J Epidemiol. 2024 Dec 10. doi: 10.1093/aje/kwae447.
4
Application of targeted maximum likelihood estimation in public health and epidemiological studies: a systematic review.靶向极大似然估计在公共卫生和流行病学研究中的应用:系统评价。
Ann Epidemiol. 2023 Oct;86:34-48.e28. doi: 10.1016/j.annepidem.2023.06.004. Epub 2023 Jun 19.
5
Efficient estimation of pathwise differentiable target parameters with the undersmoothed highly adaptive lasso.高效估计具有欠平滑高度自适应套索的路径可微目标参数。
Int J Biostat. 2022 Jul 15;19(1):261-289. doi: 10.1515/ijb-2019-0092. eCollection 2023 May 1.
6
AIPW: An R Package for Augmented Inverse Probability-Weighted Estimation of Average Causal Effects.AIPW:用于平均因果效应的增强逆概率加权估计的 R 包。
Am J Epidemiol. 2021 Dec 1;190(12):2690-2699. doi: 10.1093/aje/kwab207.
7
Challenges in Obtaining Valid Causal Effect Estimates with Machine Learning Algorithms.使用机器学习算法获取有效因果效应估计值面临的挑战。
Am J Epidemiol. 2023 Sep 1;192(9). doi: 10.1093/aje/kwab201. Epub 2021 Jul 15.
8
Demystifying Statistical Inference When Using Machine Learning in Causal Research.在因果研究中使用机器学习时揭开统计推断的神秘面纱。
Am J Epidemiol. 2021 Jul 15;192(9):1545-9. doi: 10.1093/aje/kwab200.
9
Machine Learning for Causal Inference: On the Use of Cross-fit Estimators.机器学习在因果推断中的应用:基于交叉拟合估计量的研究。
Epidemiology. 2021 May 1;32(3):393-401. doi: 10.1097/EDE.0000000000001332.
10
Machine learning in the estimation of causal effects: targeted minimum loss-based estimation and double/debiased machine learning.机器学习在因果效应估计中的应用:基于有向最小损失的估计和双重/无偏机器学习。
Biostatistics. 2020 Apr 1;21(2):353-358. doi: 10.1093/biostatistics/kxz042.