云森林：一种用于生物数据的可扩展且高效的随机森林实现方法。

CloudForest: A Scalable and Efficient Random Forest Implementation for Biological Data.

作者信息

Bressler Ryan, Kreisberg Richard B, Bernard Brady, Niederhuber John E, Vockley Joseph G, Shmulevich Ilya, Knijnenburg Theo A

机构信息

Institute for Systems Biology, Seattle, WA, United States of America.

Inova Translational Medicine Institute, Inova Health System and Inova Fairfax Medical Center, Falls Church, VA, United States of America.

出版信息

PLoS One. 2015 Dec 17;10(12):e0144820. doi: 10.1371/journal.pone.0144820. eCollection 2015.

DOI:10.1371/journal.pone.0144820

PMID:26679347

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4692062/

Abstract

Random Forest has become a standard data analysis tool in computational biology. However, extensions to existing implementations are often necessary to handle the complexity of biological datasets and their associated research questions. The growing size of these datasets requires high performance implementations. We describe CloudForest, a Random Forest package written in Go, which is particularly well suited for large, heterogeneous, genetic and biomedical datasets. CloudForest includes several extensions, such as dealing with unbalanced classes and missing values. Its flexible design enables users to easily implement additional extensions. CloudForest achieves fast running times by effective use of the CPU cache, optimizing for different classes of features and efficiently multi-threading. https://github.com/ilyalab/CloudForest.

摘要

随机森林已成为计算生物学中的标准数据分析工具。然而，为了处理生物数据集的复杂性及其相关研究问题，通常需要对现有实现进行扩展。这些数据集规模的不断扩大需要高性能的实现。我们描述了CloudForest，一个用Go编写的随机森林包，它特别适用于大型、异构的遗传和生物医学数据集。CloudForest包括几个扩展，例如处理不平衡类和缺失值。其灵活的设计使用户能够轻松实现额外的扩展。CloudForest通过有效利用CPU缓存、针对不同类别的特征进行优化以及高效的多线程处理来实现快速运行时间。https://github.com/ilyalab/CloudForest 。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0bf9/4692062/f59b95f64b47/pone.0144820.g001.jpg

相似文献

CloudForest: A Scalable and Efficient Random Forest Implementation for Biological Data.云森林：一种用于生物数据的可扩展且高效的随机森林实现方法。

PLoS One. 2015 Dec 17;10(12):e0144820. doi: 10.1371/journal.pone.0144820. eCollection 2015.

hts-nim: scripting high-performance genomic analyses.hts-nim：高性能基因组分析脚本编写。

Bioinformatics. 2018 Oct 1;34(19):3387-3389. doi: 10.1093/bioinformatics/bty358.

GSimp: A Gibbs sampler based left-censored missing value imputation approach for metabolomics studies.GSimp：一种基于 Gibbs 抽样的代谢组学研究中左截断缺失值插补方法。

PLoS Comput Biol. 2018 Jan 31;14(1):e1005973. doi: 10.1371/journal.pcbi.1005973. eCollection 2018 Jan.

Biotite: a unifying open source computational biology framework in Python.黑云母：一个用 Python 实现的统一的开源计算生物学框架。

BMC Bioinformatics. 2018 Oct 1;19(1):346. doi: 10.1186/s12859-018-2367-z.

pymzML--Python module for high-throughput bioinformatics on mass spectrometry data.pymzML--用于质谱数据高通量生物信息学的 Python 模块。

Bioinformatics. 2012 Apr 1;28(7):1052-3. doi: 10.1093/bioinformatics/bts066. Epub 2012 Feb 2.

Parallel workflow manager for non-parallel bioinformatic applications to solve large-scale biological problems on a supercomputer.用于非并行生物信息学应用程序的并行工作流管理器，以在超级计算机上解决大规模生物学问题。

J Bioinform Comput Biol. 2016 Apr;14(2):1641008. doi: 10.1142/S0219720016410080.

An optimized data structure for high-throughput 3D proteomics data: mzRTree.一种用于高通量 3D 蛋白质组学数据的优化数据结构：mzRTree。

J Proteomics. 2010 Apr 18;73(6):1176-82. doi: 10.1016/j.jprot.2010.02.006. Epub 2010 Feb 16.

Cooltools: Enabling high-resolution Hi-C analysis in Python.酷工具：在 Python 中实现高分辨率 Hi-C 分析。

PLoS Comput Biol. 2024 May 6;20(5):e1012067. doi: 10.1371/journal.pcbi.1012067. eCollection 2024 May.

SEQMINER: An R-Package to Facilitate the Functional Interpretation of Sequence-Based Associations.SEQMINER：一个用于促进基于序列关联的功能解释的R包。

Genet Epidemiol. 2015 Dec;39(8):619-23. doi: 10.1002/gepi.21918. Epub 2015 Sep 23.

NEAT: a framework for building fully automated NGS pipelines and analyses.NEAT：一个用于构建全自动二代测序流程及分析的框架。

BMC Bioinformatics. 2016 Feb 1;17:53. doi: 10.1186/s12859-016-0902-3.

引用本文的文献

Classification of non-TCGA cancer samples to TCGA molecular subtypes using compact feature sets.使用紧凑特征集将非TCGA癌症样本分类为TCGA分子亚型。

Cancer Cell. 2025 Feb 10;43(2):195-212.e11. doi: 10.1016/j.ccell.2024.12.002. Epub 2025 Jan 2.

Classification of caesarean section and normal vaginal deliveries using foetal heart rate signals and advanced machine learning algorithms.使用胎儿心率信号和先进机器学习算法对剖宫产和正常阴道分娩进行分类。

Biomed Eng Online. 2017 Jul 6;16(1):89. doi: 10.1186/s12938-017-0378-z.

本文引用的文献

A community effort to assess and improve drug sensitivity prediction algorithms.一项评估和改进药物敏感性预测算法的社区工作。

Nat Biotechnol. 2014 Dec;32(12):1202-12. doi: 10.1038/nbt.2877. Epub 2014 Jun 1.

Germline variation in cancer-susceptibility genes in a healthy, ancestrally diverse cohort: implications for individual genome sequencing.健康的、具有不同祖先背景队列中癌症易感基因的种系变异：对个人基因组测序的影响

PLoS One. 2014 Apr 11;9(4):e94554. doi: 10.1371/journal.pone.0094554. eCollection 2014.

Bias in random forest variable importance measures: illustrations, sources and a solution.随机森林变量重要性度量中的偏差：示例、来源及解决方案

BMC Bioinformatics. 2007 Jan 25;8:25. doi: 10.1186/1471-2105-8-25.

Evaluation of different biological data and computational classification methods for use in protein interaction prediction.用于蛋白质相互作用预测的不同生物学数据和计算分类方法的评估。

Proteins. 2006 May 15;63(3):490-500. doi: 10.1002/prot.20865.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

云森林：一种用于生物数据的可扩展且高效的随机森林实现方法。

CloudForest: A Scalable and Efficient Random Forest Implementation for Biological Data.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献