Department of Statistics, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA.
Genet Epidemiol. 2010 Apr;34(3):275-85. doi: 10.1002/gepi.20459.
Epistasis could be an important source of risk for disease. How interacting loci might be discovered is an open question for genome-wide association studies (GWAS). Most researchers limit their statistical analyses to testing individual pairwise interactions (i.e., marginal tests for association). A more effective means of identifying important predictors is to fit models that include many predictors simultaneously (i.e., higher-dimensional models). We explore a procedure called screen and clean (SC) for identifying liability loci, including interactions, by using the lasso procedure, which is a model selection tool for high-dimensional regression. We approach the problem by using a varying dictionary consisting of terms to include in the model. In the first step the lasso dictionary includes only main effects. The most promising single-nucleotide polymorphisms (SNPs) are identified using a screening procedure. Next the lasso dictionary is adjusted to include these main effects and the corresponding interaction terms. Again, promising terms are identified using lasso screening. Then significant terms are identified through the cleaning process. Implementation of SC for GWAS requires algorithms to explore the complex model space induced by the many SNPs genotyped and their interactions. We propose and explore a set of algorithms and find that SC successfully controls Type I error while yielding good power to identify risk loci and their interactions. When the method is applied to data obtained from the Wellcome Trust Case Control Consortium study of Type 1 Diabetes it uncovers evidence supporting interaction within the HLA class II region as well as within Chromosome 12q24.
上位性可能是疾病风险的一个重要来源。如何发现相互作用的基因座是全基因组关联研究(GWAS)的一个开放性问题。大多数研究人员将他们的统计分析限制在测试个体的两两相互作用(即关联的边际检验)上。识别重要预测因子的更有效方法是拟合同时包含多个预测因子的模型(即高维模型)。我们探索了一种称为筛选和清理(SC)的程序,通过使用套索程序来识别易感性基因座,包括相互作用,套索程序是一种高维回归的模型选择工具。我们通过使用包含在模型中的术语的变化字典来解决这个问题。在第一步中,套索字典仅包含主效应。使用筛选程序识别最有前途的单核苷酸多态性(SNP)。接下来,调整套索字典以包含这些主效应和相应的相互作用项。再次使用套索筛选来识别有前途的术语。然后通过清理过程识别显著术语。SC 用于 GWAS 的实施需要算法来探索由许多基因分型的 SNP 及其相互作用引起的复杂模型空间。我们提出并探索了一组算法,发现 SC 成功地控制了 I 型错误,同时具有识别风险基因座及其相互作用的良好功效。当该方法应用于从 Wellcome Trust 病例对照联盟研究 1 型糖尿病获得的数据时,它揭示了支持 HLA Ⅱ类区域内以及 12q24 染色体内相互作用的证据。