Liu Yahang, Gao Qian, Wei Kecheng, Huang Chen, Wang Ce, Yu Yongfu, Qin Guoyou, Wang Tong
Department of Biostatistics, School of Public Health, Fudan University, Shanghai, China.
Department of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, China.
Brief Bioinform. 2024 Jan 22;25(2). doi: 10.1093/bib/bbae059.
Recently, there has been a growing interest in variable selection for causal inference within the context of high-dimensional data. However, when the outcome exhibits a skewed distribution, ensuring the accuracy of variable selection and causal effect estimation might be challenging. Here, we introduce the generalized median adaptive lasso (GMAL) for covariate selection to achieve an accurate estimation of causal effect even when the outcome follows skewed distributions. A distinctive feature of our proposed method is that we utilize a linear median regression model for constructing penalty weights, thereby maintaining the accuracy of variable selection and causal effect estimation even when the outcome presents extremely skewed distributions. Simulation results showed that our proposed method performs comparably to existing methods in variable selection when the outcome follows a symmetric distribution. Besides, the proposed method exhibited obvious superiority over the existing methods when the outcome follows a skewed distribution. Meanwhile, our proposed method consistently outperformed the existing methods in causal estimation, as indicated by smaller root-mean-square error. We also utilized the GMAL method on a deoxyribonucleic acid methylation dataset from the Alzheimer's disease (AD) neuroimaging initiative database to investigate the association between cerebrospinal fluid tau protein levels and the severity of AD.
最近,在高维数据背景下,因果推断的变量选择受到了越来越多的关注。然而,当结果呈现偏态分布时,确保变量选择和因果效应估计的准确性可能具有挑战性。在此,我们引入广义中位数自适应套索(GMAL)进行协变量选择,以便即使在结果服从偏态分布时也能准确估计因果效应。我们提出的方法的一个显著特点是,我们使用线性中位数回归模型来构建惩罚权重,从而即使在结果呈现极偏态分布时也能保持变量选择和因果效应估计的准确性。模拟结果表明,当结果服从对称分布时,我们提出的方法在变量选择方面与现有方法表现相当。此外,当结果服从偏态分布时,该方法相对于现有方法表现出明显的优势。同时,我们提出的方法在因果估计方面始终优于现有方法,如均方根误差更小所示。我们还将GMAL方法应用于阿尔茨海默病(AD)神经影像倡议数据库中的脱氧核糖核酸甲基化数据集,以研究脑脊液tau蛋白水平与AD严重程度之间的关联。