Schwarz Daniel F, Szymczak Silke, Ziegler Andreas, König Inke R
Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Ratzeburger Allee 160, 23538 Lübeck, Germany.
BMC Proc. 2007;1 Suppl 1(Suppl 1):S59. doi: 10.1186/1753-6561-1-s1-s59. Epub 2007 Dec 18.
With the development of high-throughput single-nucleotide polymorphism (SNP) technologies, the vast number of SNPs in smaller samples poses a challenge to the application of classical statistical procedures. A possible solution is to use a two-stage approach for case-control data in which, in the first stage, a screening test selects a small number of SNPs for further analysis. The second stage then estimates the effects of the selected variables using logistic regression (logReg). Here, we introduce a novel approach in which the selection of SNPs is based on the permutation importance estimated by random forests (RFs). For this, we used the simulated data provided for the Genetic Analysis Workshop 15 without knowledge of the true model.The data set was randomly split into a first and a second data set. In the first stage, RFs were grown to pre-select the 37 most important variables, and these were reduced to 32 variables by haplotype tagging. In the second stage, we estimated parameters using logReg.The highest effect estimates were obtained for five simulated loci. We detected smoking, gender, and the parental DR alleles as covariates. After correction for multiple testing, we identified two out of four genes simulated with a direct effect on rheumatoid arthritis risk and all covariates without any false positive.We showed that a two-staged approach with a screening of SNPs by RFs is suitable to detect candidate SNPs in genome-wide association studies for complex diseases.
随着高通量单核苷酸多态性(SNP)技术的发展,小样本中大量的SNP对经典统计方法的应用构成了挑战。一种可能的解决方案是对病例对照数据采用两阶段方法,在第一阶段,筛选测试选择少量SNP进行进一步分析。然后在第二阶段使用逻辑回归(logReg)估计所选变量的效应。在此,我们介绍一种新方法,其中SNP的选择基于随机森林(RF)估计的排列重要性。为此,我们使用了为遗传分析研讨会15提供的模拟数据,而不知道真实模型。数据集被随机分为第一个和第二个数据集。在第一阶段,生长随机森林以预选择37个最重要的变量,通过单倍型标签将这些变量减少到32个。在第二阶段,我们使用逻辑回归估计参数。对于五个模拟位点获得了最高的效应估计值。我们检测到吸烟、性别和父母的DR等位基因作为协变量。在进行多重检验校正后,我们在模拟的对类风湿性关节炎风险有直接影响的四个基因中识别出两个,并且所有协变量均无任何假阳性。我们表明,采用随机森林筛选SNP的两阶段方法适用于在复杂疾病的全基因组关联研究中检测候选SNP。