Suppr超能文献

使用快速随机森林构建对大型微阵列数据集进行分类。

Classification of large microarray datasets using fast random forest construction.

作者信息

Manilich Elena A, Özsoyoğlu Z Meral, Trubachev Valeriy, Radivoyevitch Tomas

机构信息

Computer Science Department, Case Western Reserve University, Cleveland, Ohio 44106, USA.

出版信息

J Bioinform Comput Biol. 2011 Apr;9(2):251-67. doi: 10.1142/s021972001100546x.

Abstract

Random forest is an ensemble classification algorithm. It performs well when most predictive variables are noisy and can be used when the number of variables is much larger than the number of observations. The use of bootstrap samples and restricted subsets of attributes makes it more powerful than simple ensembles of trees. The main advantage of a random forest classifier is its explanatory power: it measures variable importance or impact of each factor on a predicted class label. These characteristics make the algorithm ideal for microarray data. It was shown to build models with high accuracy when tested on high-dimensional microarray datasets. Current implementations of random forest in the machine learning and statistics community, however, limit its usability for mining over large datasets, as they require that the entire dataset remains permanently in memory. We propose a new framework, an optimized implementation of a random forest classifier, which addresses specific properties of microarray data, takes computational complexity of a decision tree algorithm into consideration, and shows excellent computing performance while preserving predictive accuracy. The implementation is based on reducing overlapping computations and eliminating dependency on the size of main memory. The implementation's excellent computational performance makes the algorithm useful for interactive data analyses and data mining.

摘要

随机森林是一种集成分类算法。当大多数预测变量存在噪声时,它表现良好,并且在变量数量远大于观测数量时也可使用。自采样和属性受限子集的使用使其比简单的树集成更强大。随机森林分类器的主要优点在于其解释能力:它能够衡量变量的重要性或每个因素对预测类标签的影响。这些特性使得该算法非常适合微阵列数据。在高维微阵列数据集上进行测试时,结果表明它能够构建出高精度的模型。然而,机器学习和统计学领域中当前的随机森林实现方式限制了其在大型数据集挖掘中的可用性,因为它们要求整个数据集永久驻留在内存中。我们提出了一个新的框架,即随机森林分类器的优化实现,它解决了微阵列数据的特定属性,考虑了决策树算法的计算复杂性,并且在保持预测准确性的同时展现出出色的计算性能。该实现基于减少重叠计算并消除对主内存大小的依赖。该实现出色的计算性能使得该算法可用于交互式数据分析和数据挖掘。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验