Glaser Beate, Nikolov Ivan, Chubb Daniel, Hamshere Marian L, Segurado Ricardo, Moskvina Valentina, Holmans Peter
Biostatistics and Bioinformatics Unit, and Department of Psychological Medicine, Cardiff University, School of Medicine, Heath Park, Cardiff, Wales, CF14 4XN, UK.
BMC Proc. 2007;1 Suppl 1(Suppl 1):S54. doi: 10.1186/1753-6561-1-s1-s54. Epub 2007 Dec 18.
Using parametric and nonparametric techniques, our study investigated the presence of single locus and pairwise effects between 20 markers of the Genetic Analysis Workshop 15 (GAW15) North American Rheumatoid Arthritis Consortium (NARAC) candidate gene data set (Problem 2), analyzing 463 independent patients and 855 controls. Specifically, our work examined the correspondence between logistic regression (LR) analysis of single-locus and pairwise interaction effects, and random forest (RF) single and joint importance measures. For this comparison, we selected small but stable RFs (500 trees), which showed strong correlations (r~0.98) between their importance measures and those by RFs grown on 5000 trees. Both RF importance measures captured most of the LR single-locus and pairwise interaction effects, while joint importance measures also corresponded to full LR models containing main and interaction effects. We furthermore showed that RF measures were particularly sensitive to data imputation. The most consistent pairwise effect on rheumatoid arthritis was found between two markers within MAP3K7IP2/SUMO4 on 6q25.1, although LR and RFs assigned different significance levels.Within a hypothetical two-stage design, pairwise LR analysis of all markers with significant RF single importance would have reduced the number of possible combinations in our small data set by 61%, whereas joint importance measures would have been less efficient for marker pair reduction. This suggests that RF single importance measures, which are able to detect a wide range of interaction effects and are computationally very efficient, might be exploited as pre-screening tool for larger association studies. Follow-up analysis, such as by LR, is required since RFs do not indicate high-risk genotype combinations.
利用参数和非参数技术,我们的研究调查了遗传分析研讨会15(GAW15)北美类风湿性关节炎联盟(NARAC)候选基因数据集(问题2)中20个标记之间的单基因座和成对效应的存在情况,分析了463名独立患者和855名对照。具体而言,我们的工作检验了单基因座和成对相互作用效应的逻辑回归(LR)分析与随机森林(RF)单变量和联合重要性度量之间的对应关系。为了进行这种比较,我们选择了小而稳定的随机森林(500棵树),其重要性度量与基于5000棵树生长的随机森林的重要性度量之间显示出强相关性(r~0.98)。两种随机森林重要性度量都捕获了大部分逻辑回归单基因座和成对相互作用效应,而联合重要性度量也与包含主效应和相互作用效应的完整逻辑回归模型相对应。我们还表明,随机森林度量对数据插补特别敏感。在6q25.1上的MAP3K7IP2/SUMO4内的两个标记之间发现了对类风湿性关节炎最一致的成对效应,尽管逻辑回归和随机森林给出了不同的显著性水平。在一个假设的两阶段设计中,对所有具有显著随机森林单变量重要性的标记进行成对逻辑回归分析,将使我们小数据集中可能的组合数量减少61%,而联合重要性度量在减少标记对方面效率较低。这表明,能够检测广泛相互作用效应且计算效率非常高的随机森林单变量重要性度量,可能被用作更大规模关联研究的预筛选工具。由于随机森林不能指示高风险基因型组合,因此需要进行后续分析,例如通过逻辑回归。