Seifert Stephan, Gundlach Sven, Junge Olaf, Szymczak Silke
Institute of Medical Informatics and Statistics, Kiel University, University Hospital Schleswig-Holstein, Kiel 24105, Germany.
Bioinformatics. 2020 Aug 1;36(15):4301-4308. doi: 10.1093/bioinformatics/btaa483.
High-throughput technologies allow comprehensive characterization of individuals on many molecular levels. However, training computational models to predict disease status based on omics data is challenging. A promising solution is the integration of external knowledge about structural and functional relationships into the modeling process. We compared four published random forest-based approaches using two simulation studies and nine experimental datasets.
The self-sufficient prediction error approach should be applied when large numbers of relevant pathways are expected. The competing methods hunting and learner of functional enrichment should be used when low numbers of relevant pathways are expected or the most strongly associated pathways are of interest. The hybrid approach synthetic features is not recommended because of its high false discovery rate.
An R package providing functions for data analysis and simulation is available at GitHub (https://github.com/szymczak-lab/PathwayGuidedRF). An accompanying R data package (https://github.com/szymczak-lab/DataPathwayGuidedRF) stores the processed and quality controlled experimental datasets downloaded from Gene Expression Omnibus (GEO).
Supplementary data are available at Bioinformatics online.
高通量技术能够在多个分子层面全面表征个体。然而,基于组学数据训练计算模型来预测疾病状态具有挑战性。一个有前景的解决方案是将有关结构和功能关系的外部知识整合到建模过程中。我们使用两项模拟研究和九个实验数据集比较了四种已发表的基于随机森林的方法。
当预期有大量相关通路时,应采用自给自足预测误差方法。当预期相关通路数量较少或关注最强相关通路时,应使用竞争方法“狩猎”和功能富集“学习者”。不建议使用混合方法“合成特征”,因为其错误发现率高。
一个提供数据分析和模拟功能的R包可在GitHub上获取(https://github.com/szymczak-lab/PathwayGuidedRF)。一个配套的R数据包(https://github.com/szymczak-lab/DataPathwayGuidedRF)存储了从基因表达综合数据库(GEO)下载并经过处理和质量控制的实验数据集。
补充数据可在《生物信息学》在线获取。