Zhang Heping, Wang Minghui
Department of Epidemiology and Public Health, Yale University School of Medicine, New Haven, CT 06520-8034, USA, E-mail address:
Stat Interface. 2009 Jan 1;2(3):381. doi: 10.4310/sii.2009.v2.n3.a11.
Random forests have emerged as one of the most commonly used nonparametric statistical methods in many scientific areas, particularly in analysis of high throughput genomic data. A general practice in using random forests is to generate a sufficiently large number of trees, although it is subjective as to how large is sufficient. Furthermore, random forests are viewed as "black-box" because of its sheer size. In this work, we address a fundamental issue in the use of random forests: how large does a random forest have to be? To this end, we propose a specific method to find a sub-forest (e.g., in a single digit number of trees) that can achieve the prediction accuracy of a large random forest (in the order of thousands of trees). We tested it on extensive simulation studies and a real study on prognosis of breast cancer. The results show that such sub-forests usually exist and most of them are very small, suggesting they are actually the "representatives" of the whole random forests. We conclude that the sub-forests are indeed the core of a random forest. Thus it is not necessary to use the whole forest for satisfying prediction performance. Also, by reducing the size of a random forest to a manageable size, the random forest is no longer a black-box.
随机森林已成为许多科学领域中最常用的非参数统计方法之一,尤其是在高通量基因组数据分析中。使用随机森林的一般做法是生成足够多的树,尽管对于多少数量足够并没有客观标准。此外,由于其规模庞大,随机森林被视为“黑箱”。在这项工作中,我们解决了使用随机森林的一个基本问题:随机森林需要多大规模?为此,我们提出了一种特定方法来找到一个子森林(例如,树的数量为个位数),它能够达到大型随机森林(数千棵树的规模)的预测精度。我们在广泛的模拟研究以及一项关于乳腺癌预后的实际研究中对其进行了测试。结果表明,这样的子森林通常是存在的,并且大多数都非常小,这表明它们实际上是整个随机森林的“代表”。我们得出结论,子森林确实是随机森林的核心。因此,为了获得令人满意的预测性能,没有必要使用整个森林。此外,通过将随机森林的规模减小到可管理的大小,随机森林不再是一个黑箱。