使用快速随机森林构建对大型微阵列数据集进行分类。

Classification of large microarray datasets using fast random forest construction.

作者信息

Manilich Elena A, Özsoyoğlu Z Meral, Trubachev Valeriy, Radivoyevitch Tomas

机构信息

Computer Science Department, Case Western Reserve University, Cleveland, Ohio 44106, USA.

出版信息

J Bioinform Comput Biol. 2011 Apr;9(2):251-67. doi: 10.1142/s021972001100546x.

DOI:10.1142/s021972001100546x

PMID:21523931

Abstract

Random forest is an ensemble classification algorithm. It performs well when most predictive variables are noisy and can be used when the number of variables is much larger than the number of observations. The use of bootstrap samples and restricted subsets of attributes makes it more powerful than simple ensembles of trees. The main advantage of a random forest classifier is its explanatory power: it measures variable importance or impact of each factor on a predicted class label. These characteristics make the algorithm ideal for microarray data. It was shown to build models with high accuracy when tested on high-dimensional microarray datasets. Current implementations of random forest in the machine learning and statistics community, however, limit its usability for mining over large datasets, as they require that the entire dataset remains permanently in memory. We propose a new framework, an optimized implementation of a random forest classifier, which addresses specific properties of microarray data, takes computational complexity of a decision tree algorithm into consideration, and shows excellent computing performance while preserving predictive accuracy. The implementation is based on reducing overlapping computations and eliminating dependency on the size of main memory. The implementation's excellent computational performance makes the algorithm useful for interactive data analyses and data mining.

摘要

随机森林是一种集成分类算法。当大多数预测变量存在噪声时，它表现良好，并且在变量数量远大于观测数量时也可使用。自采样和属性受限子集的使用使其比简单的树集成更强大。随机森林分类器的主要优点在于其解释能力：它能够衡量变量的重要性或每个因素对预测类标签的影响。这些特性使得该算法非常适合微阵列数据。在高维微阵列数据集上进行测试时，结果表明它能够构建出高精度的模型。然而，机器学习和统计学领域中当前的随机森林实现方式限制了其在大型数据集挖掘中的可用性，因为它们要求整个数据集永久驻留在内存中。我们提出了一个新的框架，即随机森林分类器的优化实现，它解决了微阵列数据的特定属性，考虑了决策树算法的计算复杂性，并且在保持预测准确性的同时展现出出色的计算性能。该实现基于减少重叠计算并消除对主内存大小的依赖。该实现出色的计算性能使得该算法可用于交互式数据分析和数据挖掘。

相似文献

Classification of large microarray datasets using fast random forest construction.使用快速随机森林构建对大型微阵列数据集进行分类。

J Bioinform Comput Biol. 2011 Apr;9(2):251-67. doi: 10.1142/s021972001100546x.

Rotation of random forests for genomic and proteomic classification problems.随机森林旋转算法在基因组和蛋白质组分类问题中的应用。

Adv Exp Med Biol. 2011;696:211-21. doi: 10.1007/978-1-4419-7046-6_21.

Empirical study of seven data mining algorithms on different characteristics of datasets for biomedical classification applications.生物医学分类应用中不同数据集特征的七种数据挖掘算法的实证研究。

Biomed Eng Online. 2017 Nov 2;16(1):125. doi: 10.1186/s12938-017-0416-x.

An efficient approach for feature construction of high-dimensional microarray data by random projections.通过随机投影构建高维微阵列数据特征的有效方法。

PLoS One. 2018 Apr 27;13(4):e0196385. doi: 10.1371/journal.pone.0196385. eCollection 2018.

Multi-test decision tree and its application to microarray data classification.多测试决策树及其在微阵列数据分类中的应用。

Artif Intell Med. 2014 May;61(1):35-44. doi: 10.1016/j.artmed.2014.01.005. Epub 2014 Feb 10.

Random forests ensemble classifier trained with data resampling strategy to improve cardiac arrhythmia diagnosis.基于数据重采样策略训练的随机森林集成分类器，用于改善心律失常诊断。

Comput Biol Med. 2011 May;41(5):265-71. doi: 10.1016/j.compbiomed.2011.03.001. Epub 2011 Mar 17.

Robustness of Random Forest-based gene selection methods.基于随机森林的基因选择方法的稳健性。

BMC Bioinformatics. 2014 Jan 13;15:8. doi: 10.1186/1471-2105-15-8.

Comprehensive decision tree models in bioinformatics.生物信息学中的综合决策树模型。

PLoS One. 2012;7(3):e33812. doi: 10.1371/journal.pone.0033812. Epub 2012 Mar 30.

Ensemble Feature Learning of Genomic Data Using Support Vector Machine.使用支持向量机的基因组数据集成特征学习

PLoS One. 2016 Jun 15;11(6):e0157330. doi: 10.1371/journal.pone.0157330. eCollection 2016.

Ensemble of sparse classifiers for high-dimensional biological data.用于高维生物数据的稀疏分类器集成

Int J Data Min Bioinform. 2015;12(2):167-83. doi: 10.1504/ijdmb.2015.069416.

引用本文的文献

A protease activity-based machine-learning approach as a complementary tool for conventional diagnosis of diarrhea-predominant irritable bowel syndrome.一种基于蛋白酶活性的机器学习方法，作为腹泻型肠易激综合征传统诊断的辅助工具。

Front Microbiol. 2023 Jul 7;14:1179534. doi: 10.3389/fmicb.2023.1179534. eCollection 2023.

Rapid, High-Throughput Single-Cell Multiplex In Situ Tagging (MIST) Analysis of Immunological Disease with Machine Learning.基于机器学习的免疫性疾病高通量单细胞多重原位标记（MIST）分析

Anal Chem. 2023 May 16;95(19):7779-7787. doi: 10.1021/acs.analchem.3c01157. Epub 2023 May 4.

Use of random forest to estimate population attributable fractions from a case-control study of Salmonella enterica serotype Enteritidis infections.利用随机森林从肠炎沙门氏菌肠炎血清型感染病例对照研究中估计人群归因分数。

Epidemiol Infect. 2015 Oct;143(13):2786-94. doi: 10.1017/S095026881500014X. Epub 2015 Feb 12.

Parallel classification and feature selection in microarray data using SPRINT.使用SPRINT对微阵列数据进行并行分类和特征选择。

Concurr Comput. 2014 Mar 25;26(4):854-865. doi: 10.1002/cpe.2928.

Isoform-level gene signature improves prognostic stratification and accurately classifies glioblastoma subtypes.异构体水平基因特征可改善预后分层，并准确分类胶质母细胞瘤亚型。

Nucleic Acids Res. 2014 Apr;42(8):e64. doi: 10.1093/nar/gku121. Epub 2014 Feb 6.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

使用快速随机森林构建对大型微阵列数据集进行分类。

Classification of large microarray datasets using fast random forest construction.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献