Department of Statistics, Kansas State University, Manhattan, Kansas, USA.
Department of Biostatistics and Health Data Sciences, Indiana University School of Medicine, Indianapolis, Indiana, USA.
Stat Med. 2024 Nov 20;43(26):4928-4983. doi: 10.1002/sim.10196. Epub 2024 Sep 11.
Data irregularity in cancer genomics studies has been widely observed in the form of outliers and heavy-tailed distributions in the complex traits. In the past decade, robust variable selection methods have emerged as powerful alternatives to the nonrobust ones to identify important genes associated with heterogeneous disease traits and build superior predictive models. In this study, to keep the remarkable features of the quantile LASSO and fully Bayesian regularized quantile regression while overcoming their disadvantage in the analysis of high-dimensional genomics data, we propose the spike-and-slab quantile LASSO through a fully Bayesian spike-and-slab formulation under the robust likelihood by adopting the asymmetric Laplace distribution (ALD). The proposed robust method has inherited the prominent properties of selective shrinkage and self-adaptivity to the sparsity pattern from the spike-and-slab LASSO (Roc̆ková and George, J Am Stat Associat, 2018, 113(521): 431-444). Furthermore, the spike-and-slab quantile LASSO has a computational advantage to locate the posterior modes via soft-thresholding rule guided Expectation-Maximization (EM) steps in the coordinate descent framework, a phenomenon rarely observed for robust regularization with nondifferentiable loss functions. We have conducted comprehensive simulation studies with a variety of heavy-tailed errors in both homogeneous and heterogeneous model settings to demonstrate the superiority of the spike-and-slab quantile LASSO over its competing methods. The advantage of the proposed method has been further demonstrated in case studies of the lung adenocarcinomas (LUAD) and skin cutaneous melanoma (SKCM) data from The Cancer Genome Atlas (TCGA).
癌症基因组学研究中的数据异常以复杂性状中的异常值和重尾分布的形式广泛存在。在过去的十年中,稳健的变量选择方法已经成为识别与异质疾病特征相关的重要基因和构建优越预测模型的强大替代方法,而非稳健方法。在这项研究中,为了保持分位数 LASSO 的显著特征,并充分利用完全贝叶斯正则化分位数回归在分析高维基因组学数据方面的优势,克服其在分析中的劣势,我们通过采用非对称拉普拉斯分布 (ALD) ,在稳健似然下通过完全贝叶斯 Spike-and-Slab 公式提出了 Spike-and-Slab 分位数 LASSO。所提出的稳健方法继承了 Spike-and-Slab LASSO(Roc̆ková 和 George,J Am Stat Associat,2018,113(521):431-444)的选择性收缩和自适应稀疏模式的突出特性。此外, Spike-and-Slab 分位数 LASSO 通过坐标下降框架中的软阈值规则引导期望最大化 (EM) 步骤具有计算优势,以定位后验模式,这在具有不可微损失函数的稳健正则化中很少观察到。我们在同质和异质模型设置中进行了各种重尾误差的综合模拟研究,以证明 Spike-and-Slab 分位数 LASSO 优于其竞争方法。在来自癌症基因组图谱 (TCGA) 的肺腺癌 (LUAD) 和皮肤黑色素瘤 (SKCM) 数据的案例研究中,进一步证明了该方法的优势。