Suppr超能文献

使用通路引导的随机森林整合生物学知识和基因表达数据:一项基准研究

Integrating biological knowledge and gene expression data using pathway-guided random forests: a benchmarking study.

作者信息

Seifert Stephan, Gundlach Sven, Junge Olaf, Szymczak Silke

机构信息

Institute of Medical Informatics and Statistics, Kiel University, University Hospital Schleswig-Holstein, Kiel 24105, Germany.

出版信息

Bioinformatics. 2020 Aug 1;36(15):4301-4308. doi: 10.1093/bioinformatics/btaa483.

Abstract

MOTIVATION

High-throughput technologies allow comprehensive characterization of individuals on many molecular levels. However, training computational models to predict disease status based on omics data is challenging. A promising solution is the integration of external knowledge about structural and functional relationships into the modeling process. We compared four published random forest-based approaches using two simulation studies and nine experimental datasets.

RESULTS

The self-sufficient prediction error approach should be applied when large numbers of relevant pathways are expected. The competing methods hunting and learner of functional enrichment should be used when low numbers of relevant pathways are expected or the most strongly associated pathways are of interest. The hybrid approach synthetic features is not recommended because of its high false discovery rate.

AVAILABILITY AND IMPLEMENTATION

An R package providing functions for data analysis and simulation is available at GitHub (https://github.com/szymczak-lab/PathwayGuidedRF). An accompanying R data package (https://github.com/szymczak-lab/DataPathwayGuidedRF) stores the processed and quality controlled experimental datasets downloaded from Gene Expression Omnibus (GEO).

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

高通量技术能够在多个分子层面全面表征个体。然而,基于组学数据训练计算模型来预测疾病状态具有挑战性。一个有前景的解决方案是将有关结构和功能关系的外部知识整合到建模过程中。我们使用两项模拟研究和九个实验数据集比较了四种已发表的基于随机森林的方法。

结果

当预期有大量相关通路时,应采用自给自足预测误差方法。当预期相关通路数量较少或关注最强相关通路时,应使用竞争方法“狩猎”和功能富集“学习者”。不建议使用混合方法“合成特征”,因为其错误发现率高。

可用性与实现

一个提供数据分析和模拟功能的R包可在GitHub上获取(https://github.com/szymczak-lab/PathwayGuidedRF)。一个配套的R数据包(https://github.com/szymczak-lab/DataPathwayGuidedRF)存储了从基因表达综合数据库(GEO)下载并经过处理和质量控制的实验数据集。

补充信息

补充数据可在《生物信息学》在线获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/aac7/7520048/6ce16641a390/btaa483f1.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验