Department of Health Sciences Research, Mayo Clinic, 200 First Street Southwest, Rochester, MN 55905, USA.
BMC Bioinformatics. 2012 Jul 15;13:164. doi: 10.1186/1471-2105-13-164.
Identifying variants associated with complex human traits in high-dimensional data is a central goal of genome-wide association studies. However, complicated etiologies such as gene-gene interactions are ignored by the univariate analysis usually applied in these studies. Random Forests (RF) are a popular data-mining technique that can accommodate a large number of predictor variables and allow for complex models with interactions. RF analysis produces measures of variable importance that can be used to rank the predictor variables. Thus, single nucleotide polymorphism (SNP) analysis using RFs is gaining popularity as a potential filter approach that considers interactions in high-dimensional data. However, the impact of data dimensionality on the power of RF to identify interactions has not been thoroughly explored. We investigate the ability of rankings from variable importance measures to detect gene-gene interaction effects and their potential effectiveness as filters compared to p-values from univariate logistic regression, particularly as the data becomes increasingly high-dimensional.
RF effectively identifies interactions in low dimensional data. As the total number of predictor variables increases, probability of detection declines more rapidly for interacting SNPs than for non-interacting SNPs, indicating that in high-dimensional data the RF variable importance measures are capturing marginal effects rather than capturing the effects of interactions.
While RF remains a promising data-mining technique that extends univariate methods to condition on multiple variables simultaneously, RF variable importance measures fail to detect interaction effects in high-dimensional data in the absence of a strong marginal component, and therefore may not be useful as a filter technique that allows for interaction effects in genome-wide data.
在高维数据中识别与复杂人类特征相关的变体是全基因组关联研究的核心目标。然而,这些研究通常应用的单变量分析忽略了基因-基因相互作用等复杂病因。随机森林(RF)是一种流行的数据挖掘技术,可以容纳大量预测变量,并允许具有相互作用的复杂模型。RF 分析产生变量重要性的度量,可以用于对预测变量进行排名。因此,使用 RF 的单核苷酸多态性(SNP)分析作为一种潜在的过滤方法,正在成为一种考虑高维数据中相互作用的方法。然而,数据维度对 RF 识别相互作用的能力的影响尚未得到彻底探讨。我们研究了变量重要性度量的排名在检测基因-基因相互作用效应方面的能力,以及它们作为过滤器与单变量逻辑回归的 p 值相比的潜在有效性,特别是当数据变得越来越高维时。
RF 有效地识别低维数据中的相互作用。随着预测变量总数的增加,相互作用 SNP 的检测概率比非相互作用 SNP 下降得更快,这表明在高维数据中,RF 变量重要性度量捕捉的是边际效应,而不是捕捉相互作用的效应。
虽然 RF 仍然是一种有前途的数据挖掘技术,它将单变量方法扩展到同时对多个变量进行条件处理,但在没有强边际成分的情况下,RF 变量重要性度量无法检测到高维数据中的相互作用效应,因此可能无法作为一种允许全基因组数据中存在相互作用效应的过滤器技术。