Nicodemus Kristin K, Wang Wenyi, Shugart Yin Yao
Wellcome Trust Centre for Human Genetics, Roosevelt Drive, Oxford, OX3 7BN, UK.
BMC Proc. 2007;1 Suppl 1(Suppl 1):S58. doi: 10.1186/1753-6561-1-s1-s58. Epub 2007 Dec 18.
Risk of complex disorders is thought to be multifactorial, involving interactions between risk factors. However, many genetic studies assess association between disease status and markers one single-nucleotide polymorphism (SNP) at a time, due to the high-dimensional nature of the search space of all possible interactions. Three ensemble methods have been recently proposed for use in high-dimensional data (Monte Carlo logic regression, random forests, and generalized boosted regression). An intuitive way to detect an association between genetic markers and disease status is to use variable importance measures, even though the stability of these measures in the context of a whole-genome association study is unknown. For the simulated data of Problem 3 in the Genetic Analysis Workshop 15 (GAW15), we examined the variability of both rankings and magnitude of variable importance measures using 10 variables simulated to participate in gene x gene and gene x environment interactions. We conducted 500 analyses per method on one randomly selected replicate, tallying the rankings and importance measures for each of the 10 variables of interest. When the simulated effect size was strong, all three methods showed stable rankings and estimates of variable importance. However, under conditions more commonly expected to be encountered in complex diseases, random forests and generalized boosted regression showed more stable estimates of variable importance and variable rankings. Individuals endeavoring to apply statistical learning methods to detect interaction in complex disease studies should perform repeated analyses in order to assure variable importance measures and rankings do not vary greatly, even for statistical learning algorithms that are thought to be stable.
复杂疾病的风险被认为是多因素的,涉及风险因素之间的相互作用。然而,由于所有可能相互作用的搜索空间具有高维性,许多基因研究一次仅评估疾病状态与单个单核苷酸多态性(SNP)标记之间的关联。最近提出了三种集成方法用于高维数据(蒙特卡罗逻辑回归、随机森林和广义增强回归)。检测基因标记与疾病状态之间关联的一种直观方法是使用变量重要性度量,尽管这些度量在全基因组关联研究背景下的稳定性尚不清楚。对于遗传分析研讨会15(GAW15)中问题3的模拟数据,我们使用模拟参与基因×基因和基因×环境相互作用的10个变量,研究了变量重要性度量的排名和大小的变异性。我们对一个随机选择的重复样本每种方法进行500次分析,统计10个感兴趣变量中每个变量的排名和重要性度量。当模拟效应大小很强时,所有三种方法都显示出稳定的排名和变量重要性估计。然而,在复杂疾病中更常见的条件下,随机森林和广义增强回归显示出更稳定的变量重要性估计和变量排名。试图应用统计学习方法在复杂疾病研究中检测相互作用的个体应该进行重复分析,以确保即使对于被认为稳定的统计学习算法,变量重要性度量和排名也不会有太大变化。