Manilich Elena A, Özsoyoğlu Z Meral, Trubachev Valeriy, Radivoyevitch Tomas
Computer Science Department, Case Western Reserve University, Cleveland, Ohio 44106, USA.
J Bioinform Comput Biol. 2011 Apr;9(2):251-67. doi: 10.1142/s021972001100546x.
Random forest is an ensemble classification algorithm. It performs well when most predictive variables are noisy and can be used when the number of variables is much larger than the number of observations. The use of bootstrap samples and restricted subsets of attributes makes it more powerful than simple ensembles of trees. The main advantage of a random forest classifier is its explanatory power: it measures variable importance or impact of each factor on a predicted class label. These characteristics make the algorithm ideal for microarray data. It was shown to build models with high accuracy when tested on high-dimensional microarray datasets. Current implementations of random forest in the machine learning and statistics community, however, limit its usability for mining over large datasets, as they require that the entire dataset remains permanently in memory. We propose a new framework, an optimized implementation of a random forest classifier, which addresses specific properties of microarray data, takes computational complexity of a decision tree algorithm into consideration, and shows excellent computing performance while preserving predictive accuracy. The implementation is based on reducing overlapping computations and eliminating dependency on the size of main memory. The implementation's excellent computational performance makes the algorithm useful for interactive data analyses and data mining.
随机森林是一种集成分类算法。当大多数预测变量存在噪声时,它表现良好,并且在变量数量远大于观测数量时也可使用。自采样和属性受限子集的使用使其比简单的树集成更强大。随机森林分类器的主要优点在于其解释能力:它能够衡量变量的重要性或每个因素对预测类标签的影响。这些特性使得该算法非常适合微阵列数据。在高维微阵列数据集上进行测试时,结果表明它能够构建出高精度的模型。然而,机器学习和统计学领域中当前的随机森林实现方式限制了其在大型数据集挖掘中的可用性,因为它们要求整个数据集永久驻留在内存中。我们提出了一个新的框架,即随机森林分类器的优化实现,它解决了微阵列数据的特定属性,考虑了决策树算法的计算复杂性,并且在保持预测准确性的同时展现出出色的计算性能。该实现基于减少重叠计算并消除对主内存大小的依赖。该实现出色的计算性能使得该算法可用于交互式数据分析和数据挖掘。