pyRforest：一个用于基因组数据分析的综合R包，其特色是在R中实现了scikit-learn随机森林算法。

pyRforest: a comprehensive R package for genomic data analysis featuring scikit-learn Random Forests in R.

作者信息

Kolisnik Tyler, Keshavarz-Rahaghi Faeze, Purcell Rachel V, Smith Adam N H, Silander Olin K

机构信息

School of Mathematical and Computational Sciences, Massey University, Auckland, 0632, New Zealand.

Canada's Michael Smith Genome Sciences Centre at BC Cancer, Vancouver, British Columbia, V5Z 4S6, Canada.

出版信息

Brief Funct Genomics. 2025 Jan 15;24. doi: 10.1093/bfgp/elae038.

DOI:10.1093/bfgp/elae038

PMID:39373492

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11735746/

Abstract

Random Forest models are widely used in genomic data analysis and can offer insights into complex biological mechanisms, particularly when features influence the target in interactive, nonlinear, or nonadditive ways. Currently, some of the most efficient Random Forest methods in terms of computational speed are implemented in Python. However, many biologists use R for genomic data analysis, as R offers a unified platform for performing additional statistical analysis and visualization. Here, we present an R package, pyRforest, which integrates Python scikit-learn "RandomForestClassifier" algorithms into the R environment. pyRforest inherits the efficient memory management and parallelization of Python, and is optimized for classification tasks on large genomic datasets, such as those from RNA-seq. pyRforest offers several additional capabilities, including a novel rank-based permutation method for biomarker identification. This method can be used to estimate and visualize P-values for individual features, allowing the researcher to identify a subset of features for which there is robust statistical evidence of an effect. In addition, pyRforest includes methods for the calculation and visualization of SHapley Additive exPlanations values. Finally, pyRforest includes support for comprehensive downstream analysis for gene ontology and pathway enrichment. pyRforest thus improves the implementation and interpretability of Random Forest models for genomic data analysis by merging the strengths of Python with R. pyRforest can be downloaded at: https://www.github.com/tkolisnik/pyRforest with an associated vignette at https://github.com/tkolisnik/pyRforest/blob/main/vignettes/pyRforest-vignette.pdf.

摘要

随机森林模型在基因组数据分析中被广泛使用，并且能够为复杂的生物学机制提供见解，特别是当特征以交互、非线性或非加性方式影响目标时。目前，一些在计算速度方面最有效的随机森林方法是用Python实现的。然而，许多生物学家使用R进行基因组数据分析，因为R提供了一个用于执行额外统计分析和可视化的统一平台。在这里，我们展示了一个R包pyRforest，它将Python的scikit-learn“随机森林分类器”算法集成到R环境中。pyRforest继承了Python高效的内存管理和并行化能力，并针对大型基因组数据集（如来自RNA测序的数据）的分类任务进行了优化。pyRforest提供了几个额外的功能，包括一种用于生物标志物识别的基于排名的新型置换方法。这种方法可用于估计和可视化单个特征的P值，使研究人员能够识别出有强有力统计证据表明存在效应的特征子集。此外，pyRforest还包括计算和可视化SHapley值的方法。最后，pyRforest支持对基因本体和通路富集进行全面的下游分析。因此，pyRforest通过融合Python和R的优势，改进了基因组数据分析中随机森林模型的实现和可解释性。pyRforest可以在以下网址下载：https://www.github.com/tkolisnik/pyRforest ，相关的vignette在https://github.com/tkolisnik/pyRforest/blob/main/vignettes/pyRforest-vignette.pdf 。