Department of Biostatistics and Bioinformatics, Duke University School of Medicine, Durham, NC 27710, USA.
Bioinformatics. 2010 Jan 15;26(2):250-8. doi: 10.1093/bioinformatics/btp640. Epub 2009 Nov 18.
There is great interest in pathway-based methods for genomics data analysis in the research community. Although machine learning methods, such as random forests, have been developed to correlate survival outcomes with a set of genes, no study has assessed the abilities of these methods in incorporating pathway information for analyzing microarray data. In general, genes that are identified without incorporating biological knowledge are more difficult to interpret. Correlating pathway-based gene expression with survival outcomes may lead to biologically more meaningful prognosis biomarkers. Thus, a comprehensive study on how these methods perform in a pathway-based setting is warranted.
In this article, we describe a pathway-based method using random forests to correlate gene expression data with survival outcomes and introduce a novel bivariate node-splitting random survival forests. The proposed method allows researchers to identify important pathways for predicting patient prognosis and time to disease progression, and discover important genes within those pathways. We compared different implementations of random forests with different split criteria and found that bivariate node-splitting random survival forests with log-rank test is among the best. We also performed simulation studies that showed random forests outperforms several other machine learning algorithms and has comparable results with a newly developed component-wise Cox boosting model. Thus, pathway-based survival analysis using machine learning tools represents a promising approach in dissecting pathways and for generating new biological hypothesis from microarray studies.
R package Pwayrfsurvival is available from URL: http://www.duke.edu/~hp44/pwayrfsurvival.htm.
Supplementary data are available at Bioinformatics online.
在研究社区中,基于通路的方法在基因组学数据分析方面引起了极大的兴趣。尽管已经开发了机器学习方法(例如随机森林)来将生存结果与一组基因相关联,但尚无研究评估这些方法在整合通路信息以分析微阵列数据方面的能力。通常,不结合生物学知识而鉴定的基因更难以解释。将基于通路的基因表达与生存结果相关联可能会导致更具生物学意义的预后生物标志物。因此,有必要对这些方法在基于通路的环境中的性能进行全面研究。
在本文中,我们描述了一种基于通路的方法,该方法使用随机森林将基因表达数据与生存结果相关联,并引入了一种新的双变量节点分裂随机生存森林。该方法允许研究人员识别出重要的通路,以预测患者的预后和疾病进展时间,并发现这些通路中的重要基因。我们比较了具有不同分裂标准的随机森林的不同实现,发现对数秩检验的双变量节点分裂随机生存森林是其中最好的之一。我们还进行了模拟研究,结果表明随机森林优于其他几种机器学习算法,并且与新开发的分量 Cox 增强模型具有可比的结果。因此,使用机器学习工具进行基于通路的生存分析代表了一种有前途的方法,可以用于剖析通路并从微阵列研究中生成新的生物学假设。
R 包 Pwayrfsurvival 可从以下网址获得:http://www.duke.edu/~hp44/pwayrfsurvival.htm。
补充数据可在 Bioinformatics 在线获得。