使用通路引导的随机森林整合生物学知识和基因表达数据：一项基准研究

Integrating biological knowledge and gene expression data using pathway-guided random forests: a benchmarking study.

作者信息

Seifert Stephan, Gundlach Sven, Junge Olaf, Szymczak Silke

机构信息

Institute of Medical Informatics and Statistics, Kiel University, University Hospital Schleswig-Holstein, Kiel 24105, Germany.

出版信息

Bioinformatics. 2020 Aug 1;36(15):4301-4308. doi: 10.1093/bioinformatics/btaa483.

DOI:10.1093/bioinformatics/btaa483

PMID:32399562

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7520048/

Abstract

MOTIVATION

High-throughput technologies allow comprehensive characterization of individuals on many molecular levels. However, training computational models to predict disease status based on omics data is challenging. A promising solution is the integration of external knowledge about structural and functional relationships into the modeling process. We compared four published random forest-based approaches using two simulation studies and nine experimental datasets.

RESULTS

The self-sufficient prediction error approach should be applied when large numbers of relevant pathways are expected. The competing methods hunting and learner of functional enrichment should be used when low numbers of relevant pathways are expected or the most strongly associated pathways are of interest. The hybrid approach synthetic features is not recommended because of its high false discovery rate.

AVAILABILITY AND IMPLEMENTATION

An R package providing functions for data analysis and simulation is available at GitHub (https://github.com/szymczak-lab/PathwayGuidedRF). An accompanying R data package (https://github.com/szymczak-lab/DataPathwayGuidedRF) stores the processed and quality controlled experimental datasets downloaded from Gene Expression Omnibus (GEO).

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

高通量技术能够在多个分子层面全面表征个体。然而，基于组学数据训练计算模型来预测疾病状态具有挑战性。一个有前景的解决方案是将有关结构和功能关系的外部知识整合到建模过程中。我们使用两项模拟研究和九个实验数据集比较了四种已发表的基于随机森林的方法。

结果

当预期有大量相关通路时，应采用自给自足预测误差方法。当预期相关通路数量较少或关注最强相关通路时，应使用竞争方法“狩猎”和功能富集“学习者”。不建议使用混合方法“合成特征”，因为其错误发现率高。

可用性与实现

一个提供数据分析和模拟功能的R包可在GitHub上获取（https://github.com/szymczak-lab/PathwayGuidedRF）。一个配套的R数据包（https://github.com/szymczak-lab/DataPathwayGuidedRF）存储了从基因表达综合数据库（GEO）下载并经过处理和质量控制的实验数据集。

补充信息

补充数据可在《生物信息学》在线获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/aac7/7520048/6ce16641a390/btaa483f1.jpg

相似文献

Integrating biological knowledge and gene expression data using pathway-guided random forests: a benchmarking study.使用通路引导的随机森林整合生物学知识和基因表达数据：一项基准研究

Bioinformatics. 2020 Aug 1;36(15):4301-4308. doi: 10.1093/bioinformatics/btaa483.

FORESEE: a tool for the systematic comparison of translational drug response modeling pipelines.FORESEE：一种用于系统比较转化药物反应建模管道的工具。

Bioinformatics. 2019 Oct 1;35(19):3846-3848. doi: 10.1093/bioinformatics/btz145.

immuneSIM: tunable multi-feature simulation of B- and T-cell receptor repertoires for immunoinformatics benchmarking.immuneSIM：用于免疫信息学基准测试的 B 细胞和 T 细胞受体库的可调多特征模拟。

Bioinformatics. 2020 Jun 1;36(11):3594-3596. doi: 10.1093/bioinformatics/btaa158.

BMDx: a graphical Shiny application to perform Benchmark Dose analysis for transcriptomics data.BMDx：一个用于进行转录组学数据基准剂量分析的图形化 Shiny 应用程序。

Bioinformatics. 2020 May 1;36(9):2932-2933. doi: 10.1093/bioinformatics/btaa030.

Multi-omics data integration by generative adversarial network.基于生成对抗网络的多组学数据整合。

Bioinformatics. 2021 Dec 22;38(1):179-186. doi: 10.1093/bioinformatics/btab608.

Benchmarking time-series data discretization on inference methods.对推理方法的时间序列数据离散化进行基准测试。

Bioinformatics. 2019 Sep 1;35(17):3102-3109. doi: 10.1093/bioinformatics/btz036.

netGO: R-Shiny package for network-integrated pathway enrichment analysis.netGO：用于网络集成通路富集分析的 R-Shiny 软件包。

Bioinformatics. 2020 May 1;36(10):3283-3285. doi: 10.1093/bioinformatics/btaa077.

OpenBioLink: a benchmarking framework for large-scale biomedical link prediction.OpenBioLink：大规模生物医学链接预测的基准测试框架。

Bioinformatics. 2020 Jul 1;36(13):4097-4098. doi: 10.1093/bioinformatics/btaa274.

PyLiger: scalable single-cell multi-omic data integration in Python.PyLiger：用于 Python 的可扩展单细胞多组学数据集成。

Bioinformatics. 2022 May 13;38(10):2946-2948. doi: 10.1093/bioinformatics/btac190.

Benchmarking of 4C-seq pipelines based on real and simulated data.基于真实和模拟数据的 4C-seq 管道的基准测试。

Bioinformatics. 2019 Dec 1;35(23):4938-4945. doi: 10.1093/bioinformatics/btz426.

引用本文的文献

Evolutionary Mechanism Based Conserved Gene Expression Biclustering Module Analysis for Breast Cancer Genomics.基于进化机制的乳腺癌基因组保守基因表达双聚类模块分析

Biomedicines. 2024 Sep 12;12(9):2086. doi: 10.3390/biomedicines12092086.

Knowledge-slanted random forest method for high-dimensional data and small sample size with a feature selection application for gene expression data.适用于高维数据和小样本量的知识倾斜随机森林方法及其在基因表达数据特征选择中的应用

BioData Min. 2024 Sep 10;17(1):34. doi: 10.1186/s13040-024-00388-8.

Heterogeneous network approaches to protein pathway prediction.用于蛋白质通路预测的异构网络方法。

Comput Struct Biotechnol J. 2024 Jun 27;23:2727-2739. doi: 10.1016/j.csbj.2024.06.022. eCollection 2024 Dec.

Exploitation of surrogate variables in random forests for unbiased analysis of mutual impact and importance of features.利用随机森林中的替代变量进行无偏分析，以了解特征之间的相互影响和重要性。

Bioinformatics. 2023 Aug 1;39(8). doi: 10.1093/bioinformatics/btad471.

Transcriptomic data analysis coupled with copy number aberrations reveals a blood-based 17-gene signature for diagnosis and prognosis of patients with colorectal cancer.转录组数据分析结合拷贝数畸变揭示了一种基于血液的17基因特征，用于结直肠癌患者的诊断和预后评估。

Front Genet. 2023 Jan 6;13:1031086. doi: 10.3389/fgene.2022.1031086. eCollection 2022.

Opening the Random Forest Black Box of the Metabolome by the Application of Surrogate Minimal Depth.通过应用替代最小深度打开代谢组学的随机森林黑箱

Metabolites. 2021 Dec 21;12(1):5. doi: 10.3390/metabo12010005.

A Network-Based Methodology to Identify Subnetwork Markers for Diagnosis and Prognosis of Colorectal Cancer.一种基于网络的方法来识别用于结直肠癌诊断和预后的子网标志物。

Front Genet. 2021 Nov 1;12:721949. doi: 10.3389/fgene.2021.721949. eCollection 2021.

Risk Prediction of Cardiovascular Events by Exploration of Molecular Data with Explainable Artificial Intelligence.利用可解释人工智能探索分子数据预测心血管事件风险。

Int J Mol Sci. 2021 Sep 24;22(19):10291. doi: 10.3390/ijms221910291.

Biological knowledge-slanted random forest approach for the classification of calcified aortic valve stenosis.基于生物知识倾斜随机森林方法的钙化性主动脉瓣狭窄分类

BioData Min. 2021 Jul 23;14(1):35. doi: 10.1186/s13040-021-00269-4.

Application of random forest based approaches to surface-enhanced Raman scattering data.基于随机森林方法在表面增强拉曼散射数据中的应用。

Sci Rep. 2020 Mar 25;10(1):5436. doi: 10.1038/s41598-020-62338-8.

本文引用的文献

A comparative study of topology-based pathway enrichment analysis methods.基于拓扑的通路富集分析方法的比较研究。

BMC Bioinformatics. 2019 Nov 4;20(1):546. doi: 10.1186/s12859-019-3146-1.

Identifying significantly impacted pathways: a comprehensive review and assessment.识别受显著影响的途径：全面回顾与评估。

Genome Biol. 2019 Oct 9;20(1):203. doi: 10.1186/s13059-019-1790-4.

Assessment of network module identification across complex diseases.评估复杂疾病中的网络模块识别。

Nat Methods. 2019 Sep;16(9):843-852. doi: 10.1038/s41592-019-0509-5. Epub 2019 Aug 30.

Surrogate minimal depth as an importance measure for variables in random forests.替代最小深度作为随机森林中变量的重要性度量。

Bioinformatics. 2019 Oct 1;35(19):3663-3671. doi: 10.1093/bioinformatics/btz149.

Gene set analysis methods: a systematic comparison.基因集分析方法：系统比较

BioData Min. 2018 May 31;11:8. doi: 10.1186/s13040-018-0166-8. eCollection 2018.

The revival of the Gini importance?基尼重要性的复兴？

Bioinformatics. 2018 Nov 1;34(21):3711-3718. doi: 10.1093/bioinformatics/bty373.

Evaluation of variable selection methods for random forests and omics data sets.随机森林和组学数据集变量选择方法的评估。

Brief Bioinform. 2019 Mar 22;20(2):492-503. doi: 10.1093/bib/bbx124.

Towards evidence-based computational statistics: lessons from clinical research on the role and design of real-data benchmark studies.迈向基于证据的计算统计学：从临床研究中汲取关于真实数据基准研究作用和设计的经验教训。

BMC Med Res Methodol. 2017 Sep 9;17(1):138. doi: 10.1186/s12874-017-0417-2.

Whose sample is it anyway? Widespread misannotation of samples in transcriptomics studies.这到底是谁的样本？转录组学研究中样本的广泛错误注释。

F1000Res. 2016 Aug 30;5:2103. doi: 10.12688/f1000research.9471.2. eCollection 2016.

The Molecular Signatures Database (MSigDB) hallmark gene set collection.分子特征数据库（MSigDB）标志性基因集集合。

Cell Syst. 2015 Dec 23;1(6):417-425. doi: 10.1016/j.cels.2015.12.004.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

使用通路引导的随机森林整合生物学知识和基因表达数据：一项基准研究

Integrating biological knowledge and gene expression data using pathway-guided random forests: a benchmarking study.

作者信息

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY AND IMPLEMENTATION

SUPPLEMENTARY INFORMATION

动机

结果

可用性与实现

补充信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献