Chen Xi, Wang Lily
Department of Quantitative Health Sciences, The Cleveland Clinic, Cleveland, OH 44195, USA.
J Comput Biol. 2009 Feb;16(2):265-78. doi: 10.1089/cmb.2008.12TT.
Due to the large variability in survival times between cancer patients and the plethora of genes on microarrays unrelated to outcome, building accurate prediction models that are easy to interpret remains a challenge. In this paper, we propose a general strategy for improving performance and interpretability of prediction models by integrating gene expression data with prior biological knowledge. First, we link gene identifiers in expression dataset with gene annotation databases such as Gene Ontology (GO). Then we construct "supergenes" for each gene category by summarizing information from genes related to outcome using a modified principal component analysis (PCA) method. Finally, instead of using genes as predictors, we use these supergenes representing information from each gene category as predictors to predict survival outcome. In addition to identifying gene categories associated with outcome, the proposed approach also carries out additional within-category selection to select important genes within each gene set. We show, using two real breast cancer microarray datasets, that the prediction models constructed based on gene sets (or pathway) information outperform the prediction models based on expression values of single genes, with improved prediction accuracy and interpretability.
由于癌症患者之间生存时间差异巨大,且微阵列上大量基因与预后无关,构建易于解释的准确预测模型仍然是一项挑战。在本文中,我们提出了一种通用策略,通过将基因表达数据与先验生物学知识相结合来提高预测模型的性能和可解释性。首先,我们将表达数据集中的基因标识符与诸如基因本体论(GO)等基因注释数据库相链接。然后,我们使用改进的主成分分析(PCA)方法,通过汇总与预后相关基因的信息,为每个基因类别构建“超级基因”。最后,我们不是使用单个基因作为预测因子,而是使用这些代表每个基因类别信息的超级基因作为预测因子来预测生存结果。除了识别与预后相关的基因类别外,所提出的方法还进行额外的类别内选择,以在每个基因集中选择重要基因。我们使用两个真实的乳腺癌微阵列数据集表明,基于基因集(或通路)信息构建的预测模型优于基于单个基因表达值构建的预测模型,具有更高的预测准确性和可解释性。