Cutler D Richard, Edwards Thomas C, Beard Karen H, Cutler Adele, Hess Kyle T, Gibson Jacob, Lawler Joshua J
Department of Mathematics and Statistics, Utah State University, Logan, Utah 84322-3900, USA.
Ecology. 2007 Nov;88(11):2783-92. doi: 10.1890/07-0539.1.
Classification procedures are some of the most widely used statistical methods in ecology. Random forests (RF) is a new and powerful statistical classifier that is well established in other disciplines but is relatively unknown in ecology. Advantages of RF compared to other statistical classifiers include (1) very high classification accuracy; (2) a novel method of determining variable importance; (3) ability to model complex interactions among predictor variables; (4) flexibility to perform several types of statistical data analysis, including regression, classification, survival analysis, and unsupervised learning; and (5) an algorithm for imputing missing values. We compared the accuracies of RF and four other commonly used statistical classifiers using data on invasive plant species presence in Lava Beds National Monument, California, USA, rare lichen species presence in the Pacific Northwest, USA, and nest sites for cavity nesting birds in the Uinta Mountains, Utah, USA. We observed high classification accuracy in all applications as measured by cross-validation and, in the case of the lichen data, by independent test data, when comparing RF to other common classification methods. We also observed that the variables that RF identified as most important for classifying invasive plant species coincided with expectations based on the literature.
分类程序是生态学中使用最广泛的一些统计方法。随机森林(RF)是一种新的强大统计分类器,在其他学科中已得到充分确立,但在生态学中相对鲜为人知。与其他统计分类器相比,随机森林的优势包括:(1)非常高的分类准确率;(2)一种确定变量重要性的新方法;(3)对预测变量之间复杂相互作用进行建模的能力;(4)进行多种类型统计数据分析的灵活性,包括回归、分类、生存分析和无监督学习;以及(5)一种估算缺失值的算法。我们使用美国加利福尼亚州拉瓦贝德国家纪念区入侵植物物种存在的数据、美国太平洋西北部稀有地衣物种存在的数据以及美国犹他州尤因塔山脉树洞筑巢鸟类的巢穴地点数据,比较了随机森林和其他四种常用统计分类器的准确率。在将随机森林与其他常见分类方法进行比较时,我们通过交叉验证以及在地衣数据的情况下通过独立测试数据观察到,在所有应用中分类准确率都很高。我们还观察到,随机森林确定的对入侵植物物种分类最重要的变量与基于文献的预期相符。