Hong Sungyeon, Kim Yongkang, Park Taesung
Department of Statistics, Seoul National University, Seoul, South Korea.
Department of Statistics, Seoul National University, Seoul, South Korea. ; Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, South Korea.
Cancer Inform. 2015 Jan 14;13(Suppl 7):55-65. doi: 10.4137/CIN.S16350. eCollection 2014.
Variable selection methods play an important role in high-dimensional statistical modeling and analysis. Computational cost and estimation accuracy are the two main concerns for statistical inference from ultrahigh-dimensional data. In particular, genome-wide association studies (GWAS), which focus on identifying single nucleotide polymorphisms (SNPs) associated with a disease of interest, have produced ultrahigh-dimensional data. Numerous methods have been proposed to handle GWAS data. Most statistical methods have adopted a two-stage approach: pre-screening for dimensional reduction and variable selection to identify causal SNPs. The pre-screening step selects SNPs in terms of their P-values or the absolute values of the regression coefficients in single SNP analysis. Penalized regressions, such as the ridge, lasso, adaptive lasso, and elastic-net regressions, are commonly used for the variable selection step. In this paper, we investigate which combination of pre-screening method and penalized regression performs best on a quantitative phenotype using two real GWAS datasets.
变量选择方法在高维统计建模与分析中发挥着重要作用。计算成本和估计精度是超高维数据统计推断的两个主要关注点。特别是全基因组关联研究(GWAS),其专注于识别与感兴趣疾病相关的单核苷酸多态性(SNP),已经产生了超高维数据。已经提出了许多方法来处理GWAS数据。大多数统计方法都采用了两阶段方法:进行降维预筛选和变量选择以识别因果SNP。预筛选步骤根据单SNP分析中的P值或回归系数的绝对值来选择SNP。惩罚回归,如岭回归、套索回归、自适应套索回归和弹性网回归,通常用于变量选择步骤。在本文中,我们使用两个真实的GWAS数据集研究预筛选方法和惩罚回归的哪种组合在定量表型上表现最佳。