McCoy David, Hubbard Alan, Van der Laan Mark
Division of Environmental Health Sciences, University of California, Berkeley, CA, United States of America.
Department of Biostatistics, University of California, Berkeley, CA, United States of America.
J Open Source Softw. 2023;8(82). doi: 10.21105/joss.04181. Epub 2023 Feb 21.
Statistical causal inference of mixed exposures has been limited by reliance on parametric models and, until recently, by researchers considering only one exposure at a time, usually estimated as a beta coefficient in a generalized linear regression model (GLM). This independent assessment of exposures poorly estimates the joint impact of a collection of the same exposures in a realistic exposure setting. Marginal methods for mixture variable selection such as ridge/lasso regression are biased by linear assumptions and the interactions modeled are chosen by the user. Clustering methods such as principal component regression lose both interpretability and valid inference. Newer mixture methods such as quantile g-computation (Keil et al., 2020) are biased by linear/additive assumptions. More flexible methods such as Bayesian kernel machine regression (BKMR)(Bobb et al., 2014) are sensitive to the choice of tuning parameters, are computationally taxing and lack an interpretable and robust summary statistic of dose-response relationships. No methods currently exist which finds the best flexible model to adjust for covariates while applying a non-parametric model that targets for interactions in a mixture and delivers valid inference for a target parameter. Non-parametric methods such as decision trees are a useful tool to evaluate combined exposures by finding partitions in the joint-exposure (mixture) space that best explain the variance in an outcome. However, current methods using decision trees to assess statistical inference for interactions are biased and are prone to overfitting by using the full data to both identify nodes in the tree and make statistical inference given these nodes. Other methods have used an independent test set to derive inference which does not use the full data. The CVtreeMLE R package provides researchers in (bio)statistics, epidemiology, and environmental health sciences with access to state-of-the-art statistical methodology for evaluating the causal effects of a data-adaptively determined mixed exposure using decision trees. Our target audience are those analysts who would normally use a potentially biased GLM based model for a mixed exposure. Instead, we hope to provide users with a non-parametric statistical machine where users simply specify the exposures, covariates and outcome, CVtreeMLE then determines if a best fitting decision tree exists and delivers interpretable results.
混合暴露的统计因果推断一直受到依赖参数模型的限制,并且直到最近,还受到研究人员一次仅考虑一种暴露的限制,通常在广义线性回归模型(GLM)中估计为β系数。这种对暴露的独立评估很难估计在实际暴露环境中同一组暴露的联合影响。诸如岭回归/套索回归等混合变量选择的边际方法受到线性假设的偏差,并且用户选择所建模的相互作用。诸如主成分回归等聚类方法既失去了可解释性,又失去了有效推断。诸如分位数g计算(Keil等人,2020)等较新的混合方法受到线性/加性假设的偏差。诸如贝叶斯核机器回归(BKMR)(Bobb等人,2014)等更灵活的方法对调优参数的选择敏感,计算量大,并且缺乏剂量反应关系的可解释和稳健的汇总统计量。目前不存在这样的方法,即在应用针对混合物中的相互作用的非参数模型并为目标参数提供有效推断的同时,找到用于调整协变量的最佳灵活模型。诸如决策树等非参数方法是通过在联合暴露(混合物)空间中找到最能解释结果方差的分区来评估组合暴露的有用工具。然而,当前使用决策树评估相互作用统计推断的方法存在偏差,并且由于使用完整数据来识别树中的节点并基于这些节点进行统计推断而容易过度拟合。其他方法使用独立测试集来得出不使用完整数据的推断。CVtreeMLE R包为(生物)统计学、流行病学和环境卫生科学领域的研究人员提供了使用决策树评估数据自适应确定的混合暴露因果效应的最新统计方法。我们的目标受众是那些通常会使用基于潜在偏差的GLM模型进行混合暴露的分析师。相反,我们希望为用户提供一种非参数统计机器,用户只需指定暴露、协变量和结果,CVtreeMLE然后确定是否存在最佳拟合决策树并提供可解释的结果。