Butzin-Dozier Zachary, Qiu Sky, Hubbard Alan E, Shi Junming Seraphina, van der Laan Mark J
Department of Biostatistics, University of California, Berkeley, Berkeley, CA 94704.
medRxiv. 2024 Oct 19:2024.10.18.24315778. doi: 10.1101/2024.10.18.24315778.
Understanding treatment effects on health-related outcomes using real-world data requires defining a causal parameter and imposing relevant identification assumptions to translate it into a statistical estimand. Semiparametric methods, like the targeted maximum likelihood estimator (TMLE), have been developed to construct asymptotically linear estimators of these parameters. To further establish the asymptotic efficiency of these estimators, two conditions must be met: 1) the relevant components of the data likelihood must fall within a Donsker class, and 2) the estimates of nuisance parameters must converge to their true values at a rate faster than . The Highly Adaptive LASSO (HAL) satisfies these criteria by acting as an empirical risk minimizer within a class of functions with a bounded sectional variation norm, which is known to be Donsker. HAL achieves the desired rate of convergence, thereby guaranteeing the estimators' asymptotic efficiency. The function class over which HAL minimizes its risk is flexible enough to capture realistic functions while maintaining the conditions for establishing efficiency. Additionally, HAL enables robust inference for non-pathwise differentiable parameters, such as the conditional average treatment effect (CATE) and causal dose-response curve, which are important in precision health. While these parameters are often considered in machine learning literature, these applications typically lack proper statistical inference. HAL addresses this gap by providing reliable statistical uncertainty quantification that is essential for informed decision-making in health research.
使用真实世界数据理解治疗对健康相关结局的影响需要定义一个因果参数,并施加相关的识别假设,以便将其转化为一个统计估计量。已经开发出半参数方法,如靶向最大似然估计器(TMLE),来构建这些参数的渐近线性估计量。为了进一步确立这些估计量的渐近效率,必须满足两个条件:1)数据似然的相关分量必须属于一个唐斯克类;2)干扰参数的估计必须以比 更快的速率收敛到其真实值。高度自适应套索(HAL)通过在一类具有有界截面变差范数的函数内充当经验风险最小化器来满足这些标准,已知该类函数是唐斯克类。HAL实现了所需的收敛速率,从而保证了估计量的渐近效率。HAL最小化其风险的函数类足够灵活,能够捕捉现实中的函数,同时保持确立效率的条件。此外,HAL能够对非路径可微参数进行稳健推断,如条件平均治疗效果(CATE)和因果剂量反应曲线,这些在精准健康中很重要。虽然这些参数在机器学习文献中经常被考虑,但这些应用通常缺乏适当的统计推断。HAL通过提供可靠的统计不确定性量化来填补这一空白,这对于健康研究中的明智决策至关重要。