Suppr超能文献

寻找最小的随机森林。

Search for the smallest random forest.

作者信息

Zhang Heping, Wang Minghui

机构信息

Department of Epidemiology and Public Health, Yale University School of Medicine, New Haven, CT 06520-8034, USA, E-mail address:

出版信息

Stat Interface. 2009 Jan 1;2(3):381. doi: 10.4310/sii.2009.v2.n3.a11.

Abstract

Random forests have emerged as one of the most commonly used nonparametric statistical methods in many scientific areas, particularly in analysis of high throughput genomic data. A general practice in using random forests is to generate a sufficiently large number of trees, although it is subjective as to how large is sufficient. Furthermore, random forests are viewed as "black-box" because of its sheer size. In this work, we address a fundamental issue in the use of random forests: how large does a random forest have to be? To this end, we propose a specific method to find a sub-forest (e.g., in a single digit number of trees) that can achieve the prediction accuracy of a large random forest (in the order of thousands of trees). We tested it on extensive simulation studies and a real study on prognosis of breast cancer. The results show that such sub-forests usually exist and most of them are very small, suggesting they are actually the "representatives" of the whole random forests. We conclude that the sub-forests are indeed the core of a random forest. Thus it is not necessary to use the whole forest for satisfying prediction performance. Also, by reducing the size of a random forest to a manageable size, the random forest is no longer a black-box.

摘要

随机森林已成为许多科学领域中最常用的非参数统计方法之一,尤其是在高通量基因组数据分析中。使用随机森林的一般做法是生成足够多的树,尽管对于多少数量足够并没有客观标准。此外,由于其规模庞大,随机森林被视为“黑箱”。在这项工作中,我们解决了使用随机森林的一个基本问题:随机森林需要多大规模?为此,我们提出了一种特定方法来找到一个子森林(例如,树的数量为个位数),它能够达到大型随机森林(数千棵树的规模)的预测精度。我们在广泛的模拟研究以及一项关于乳腺癌预后的实际研究中对其进行了测试。结果表明,这样的子森林通常是存在的,并且大多数都非常小,这表明它们实际上是整个随机森林的“代表”。我们得出结论,子森林确实是随机森林的核心。因此,为了获得令人满意的预测性能,没有必要使用整个森林。此外,通过将随机森林的规模减小到可管理的大小,随机森林不再是一个黑箱。

相似文献

1
Search for the smallest random forest.寻找最小的随机森林。
Stat Interface. 2009 Jan 1;2(3):381. doi: 10.4310/sii.2009.v2.n3.a11.
2
Oblique and rotation double random forest.倾斜和旋转双重随机森林。
Neural Netw. 2022 Sep;153:496-517. doi: 10.1016/j.neunet.2022.06.012. Epub 2022 Jun 18.
3
5
Determinants of carbon sequestration in thinned forests.森林疏伐对碳固存的影响因素。
Sci Total Environ. 2024 Nov 15;951:175540. doi: 10.1016/j.scitotenv.2024.175540. Epub 2024 Aug 14.

引用本文的文献

6
Weighted Random Forests to Improve Arrhythmia Classification.用于改善心律失常分类的加权随机森林
Electronics (Basel). 2020 Jan;9(1). doi: 10.3390/electronics9010099. Epub 2020 Jan 3.
8
Impact of ecological redundancy on the performance of machine learning classifiers in vegetation mapping.
Ecol Evol. 2018 Jun 11;8(13):6728-6737. doi: 10.1002/ece3.4176. eCollection 2018 Jul.
9
Energy bagging tree.能量装袋树
Stat Interface. 2016;9(2):171-181. doi: 10.4310/SII.2016.v9.n2.a5.

本文引用的文献

2
A forest-based approach to identifying gene and gene gene interactions.一种基于森林模型的基因及基因-基因相互作用识别方法。
Proc Natl Acad Sci U S A. 2007 Dec 4;104(49):19199-203. doi: 10.1073/pnas.0709868104. Epub 2007 Nov 28.
4
Cell and tumor classification using gene expression data: construction of forests.利用基因表达数据进行细胞和肿瘤分类:森林构建
Proc Natl Acad Sci U S A. 2003 Apr 1;100(7):4168-72. doi: 10.1073/pnas.0230559100. Epub 2003 Mar 17.
8
Predicting the clinical status of human breast cancer by using gene expression profiles.利用基因表达谱预测人类乳腺癌的临床状态。
Proc Natl Acad Sci U S A. 2001 Sep 25;98(20):11462-7. doi: 10.1073/pnas.201162998. Epub 2001 Sep 18.
10
Gene-expression profiles in hereditary breast cancer.遗传性乳腺癌中的基因表达谱
N Engl J Med. 2001 Feb 22;344(8):539-48. doi: 10.1056/NEJM200102223440801.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验