Karim Mohammad Ehsanul, Mondol Momenul Haque
School of Population and Public Health, University of British Columbia, Vancouver, British Columbia, Canada.
Centre for Advancing Health Outcomes, University of British Columbia, Vancouver, British Columbia, Canada.
Pharm Stat. 2025 Sep-Oct;24(5):e70022. doi: 10.1002/pst.70022.
Flexible machine learning algorithms are increasingly utilized in real-world data analyses. When integrated within double robust methods, such as the Targeted Maximum Likelihood Estimator (TMLE), complex estimators can result in significant undercoverage-an issue that is even more pronounced in singly robust methods. The Double Cross-Fitting (DCF) procedure complements these methods by enabling the use of diverse machine learning estimators, yet optimal guidelines for the number of data splits and repetitions remain unclear. This study aims to explore the effects of varying the number of splits and repetitions in DCF on TMLE estimators through statistical simulations and a data analysis. We discuss two generalizations of DCF beyond the conventional three splits and apply a range of splits to fit the TMLE estimator, incorporating a super learner without transforming covariates. The statistical properties of these configurations are compared across two sample sizes (3000 and 5000) and two DCF generalizations (equal splits and full data use). Additionally, we conduct a real-world analysis using data from the National Health and Nutrition Examination Survey (NHANES) 2017-18 cycle to illustrate the practical implications of varying DCF splits, focusing on the association between obesity and the risk of developing diabetes. Our simulation study reveals that five splits in DCF yield satisfactory bias, variance, and coverage across scenarios. In the real-world application, the DCF TMLE method showed consistent risk difference estimates over a range of splits, though standard errors increased with more splits in one generalization, suggesting potential drawbacks to excessive splitting. This research underscores the importance of judicious selection of the number of splits and repetitions in DCF TMLE methods to achieve a balance between computational efficiency and accurate statistical inference. Optimal performance seems attainable with three to five splits. Among the generalizations considered, using full data for nuisance estimation offered more consistent variance estimation and is preferable for applied use. Additionally, increasing the repetitions beyond 25 did not enhance performance, providing crucial guidance for researchers employing complex machine learning algorithms in causal studies and advocating for cautious split management in DCF procedures.
灵活的机器学习算法在实际数据分析中越来越多地被使用。当集成到双重稳健方法中时,例如靶向最大似然估计器(TMLE),复杂的估计器可能会导致显著的覆盖不足——这个问题在单重稳健方法中更为明显。双重交叉拟合(DCF)程序通过允许使用多种机器学习估计器来补充这些方法,但关于数据分割和重复次数的最佳指导原则仍不明确。本研究旨在通过统计模拟和数据分析,探讨DCF中分割和重复次数的变化对TMLE估计器的影响。我们讨论了DCF在传统的三次分割之外的两种推广,并应用一系列分割来拟合TMLE估计器,纳入了一个不转换协变量的超级学习者。在两个样本量(3000和5000)和两种DCF推广(等分割和全数据使用)下比较了这些配置的统计特性。此外,我们使用2017 - 18年国家健康与营养检查调查(NHANES)周期的数据进行了实际分析,以说明DCF分割变化的实际影响,重点关注肥胖与患糖尿病风险之间的关联。我们的模拟研究表明,DCF中的五次分割在各种情况下产生了令人满意的偏差、方差和覆盖率。在实际应用中,DCF TMLE方法在一系列分割中显示出一致的风险差异估计,尽管在一种推广中标准误差随着分割次数的增加而增加,这表明过度分割存在潜在缺点。这项研究强调了在DCF TMLE方法中明智选择分割和重复次数的重要性,以在计算效率和准确的统计推断之间取得平衡。三到五次分割似乎能达到最佳性能。在所考虑的推广中,使用全数据进行干扰估计提供了更一致的方差估计,并且更适合实际应用。此外,将重复次数增加到25次以上并没有提高性能,这为在因果研究中使用复杂机器学习算法的研究人员提供了关键指导,并倡导在DCF程序中谨慎进行分割管理。