Statistical Genetics, Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, UK.
BMC Bioinformatics. 2010 Feb 27;11:110. doi: 10.1186/1471-2105-11-110.
Random forests (RF) have been increasingly used in applications such as genome-wide association and microarray studies where predictor correlation is frequently observed. Recent works on permutation-based variable importance measures (VIMs) used in RF have come to apparently contradictory conclusions. We present an extended simulation study to synthesize results.
In the case when both predictor correlation was present and predictors were associated with the outcome (HA), the unconditional RF VIM attributed a higher share of importance to correlated predictors, while under the null hypothesis that no predictors are associated with the outcome (H0) the unconditional RF VIM was unbiased. Conditional VIMs showed a decrease in VIM values for correlated predictors versus the unconditional VIMs under HA and was unbiased under H0. Scaled VIMs were clearly biased under HA and H0.
Unconditional unscaled VIMs are a computationally tractable choice for large datasets and are unbiased under the null hypothesis. Whether the observed increased VIMs for correlated predictors may be considered a "bias" - because they do not directly reflect the coefficients in the generating model - or if it is a beneficial attribute of these VIMs is dependent on the application. For example, in genetic association studies, where correlation between markers may help to localize the functionally relevant variant, the increased importance of correlated predictors may be an advantage. On the other hand, we show examples where this increased importance may result in spurious signals.
随机森林 (RF) 在全基因组关联和微阵列研究等应用中得到了越来越多的应用,在这些应用中经常观察到预测器相关性。最近关于 RF 中使用的基于置换的变量重要性度量 (VIM) 的工作得出了明显矛盾的结论。我们提出了一个扩展的模拟研究来综合结果。
在存在预测器相关性且预测器与结果相关(HA)的情况下,无条件 RF VIM 将更高的重要性份额归因于相关预测器,而在没有预测器与结果相关的零假设(H0)下,无条件 RF VIM 是无偏的。在 HA 下,与无条件 RF VIM 相比,条件 VIM 显示出相关预测器的 VIM 值下降,而在 H0 下则无偏。在 HA 和 H0 下,缩放 VIM 明显有偏差。
无条件未缩放 VIM 是大型数据集的计算上可行的选择,并且在零假设下是无偏的。对于相关预测器,观察到的增加的 VIM 是否可以被认为是一种“偏差”——因为它们不能直接反映生成模型中的系数——或者这些 VIM 是否是有益的属性取决于应用。例如,在遗传关联研究中,标记之间的相关性可能有助于定位功能相关的变体,相关预测器的重要性增加可能是一个优势。另一方面,我们展示了一些例子,其中这种增加的重要性可能导致虚假信号。