Wei Zhi, Li Hongzhe
Genomics and Computational Biology Graduate Group, University of Pennsylvania School of Medicine, Philadelphia, PA 19104, USA.
Biostatistics. 2007 Apr;8(2):265-84. doi: 10.1093/biostatistics/kxl007. Epub 2006 Jun 13.
High-throughout genomic data provide an opportunity for identifying pathways and genes that are related to various clinical phenotypes. Besides these genomic data, another valuable source of data is the biological knowledge about genes and pathways that might be related to the phenotypes of many complex diseases. Databases of such knowledge are often called the metadata. In microarray data analysis, such metadata are currently explored in post hoc ways by gene set enrichment analysis but have hardly been utilized in the modeling step. We propose to develop and evaluate a pathway-based gradient descent boosting procedure for nonparametric pathways-based regression (NPR) analysis to efficiently integrate genomic data and metadata. Such NPR models consider multiple pathways simultaneously and allow complex interactions among genes within the pathways and can be applied to identify pathways and genes that are related to variations of the phenotypes. These methods also provide an alternative to mediating the problem of a large number of potential interactions by limiting analysis to biologically plausible interactions between genes in related pathways. Our simulation studies indicate that the proposed boosting procedure can indeed identify relevant pathways. Application to a gene expression data set on breast cancer distant metastasis identified that Wnt, apoptosis, and cell cycle-regulated pathways are more likely related to the risk of distant metastasis among lymph-node-negative breast cancer patients. Results from analysis of other two breast cancer gene expression data sets indicate that the pathways of Metalloendopeptidases (MMPs) and MMP inhibitors, as well as cell proliferation, cell growth, and maintenance are important to breast cancer relapse and survival. We also observed that by incorporating the pathway information, we achieved better prediction for cancer recurrence.
高通量基因组数据为识别与各种临床表型相关的通路和基因提供了契机。除了这些基因组数据外,另一个有价值的数据来源是关于可能与许多复杂疾病表型相关的基因和通路的生物学知识。此类知识的数据库通常被称为元数据。在微阵列数据分析中,目前此类元数据是通过基因集富集分析以事后方式进行探索的,但在建模步骤中几乎未被利用。我们建议开发并评估一种基于通路的梯度下降增强程序,用于非参数基于通路的回归(NPR)分析,以有效整合基因组数据和元数据。此类NPR模型同时考虑多个通路,并允许通路内基因之间存在复杂的相互作用,可用于识别与表型变异相关的通路和基因。这些方法还提供了一种替代方案,通过将分析限制在相关通路中基因之间生物学上合理的相互作用来解决大量潜在相互作用的问题。我们的模拟研究表明,所提出的增强程序确实能够识别相关通路。应用于乳腺癌远处转移的基因表达数据集发现,Wnt、凋亡和细胞周期调节通路更有可能与淋巴结阴性乳腺癌患者的远处转移风险相关。对其他两个乳腺癌基因表达数据集的分析结果表明,金属内肽酶(MMPs)及其抑制剂通路以及细胞增殖、细胞生长和维持对乳腺癌复发和生存很重要。我们还观察到,通过纳入通路信息,我们对癌症复发实现了更好的预测。