Tian Xin, Xin Mingyuan, Luo Jian, Liu Mingyao, Jiang Zhenran
1 Shanghai Key Laboratory of Regulatory Biology, Institute of Biomedical Sciences and School of Life Sciences, East China Normal University , Shanghai, China .
2 Shanghai Key Laboratory of Multidimensional Information Processing, Department of Computer Science and Technology, East China Normal University , Shanghai, China .
J Comput Biol. 2017 Feb;24(2):172-182. doi: 10.1089/cmb.2015.0206. Epub 2016 Aug 10.
The selection of relevant genes for breast cancer metastasis is critical for the treatment and prognosis of cancer patients. Although much effort has been devoted to the gene selection procedures by use of different statistical analysis methods or computational techniques, the interpretation of the variables in the resulting survival models has been limited so far. This article proposes a new Random Forest (RF)-based algorithm to identify important variables highly related with breast cancer metastasis, which is based on the important scores of two variable selection algorithms, including the mean decrease Gini (MDG) criteria of Random Forest and the GeneRank algorithm with protein-protein interaction (PPI) information. The new gene selection algorithm can be called PPIRF. The improved prediction accuracy fully illustrated the reliability and high interpretability of gene list selected by the PPIRF approach.
选择与乳腺癌转移相关的基因对于癌症患者的治疗和预后至关重要。尽管已经通过使用不同的统计分析方法或计算技术在基因选择程序上投入了大量精力,但到目前为止,所得生存模型中变量的解释一直很有限。本文提出了一种基于随机森林(RF)的新算法,以识别与乳腺癌转移高度相关的重要变量,该算法基于两种变量选择算法的重要得分,包括随机森林的平均基尼指数下降(MDG)标准和具有蛋白质-蛋白质相互作用(PPI)信息的基因排序算法。这种新的基因选择算法可称为PPIRF。预测准确性的提高充分说明了通过PPIRF方法选择的基因列表的可靠性和高可解释性。