Department of Computer Science and Software Engineering, The University of Melbourne, Parkville 3010, VIC, Australia.
BMC Bioinformatics. 2010 May 25;11:277. doi: 10.1186/1471-2105-11-277.
Different microarray studies have compiled gene lists for predicting outcomes of a range of treatments and diseases. These have produced gene lists that have little overlap, indicating that the results from any one study are unstable. It has been suggested that the underlying pathways are essentially identical, and that the expression of gene sets, rather than that of individual genes, may be more informative with respect to prognosis and understanding of the underlying biological process.
We sought to examine the stability of prognostic signatures based on gene sets rather than individual genes. We classified breast cancer cases from five microarray studies according to the risk of metastasis, using features derived from predefined gene sets. The expression levels of genes in the sets are aggregated, using what we call a set statistic. The resulting prognostic gene sets were as predictive as the lists of individual genes, but displayed more consistent rankings via bootstrap replications within datasets, produced more stable classifiers across different datasets, and are potentially more interpretable in the biological context since they examine gene expression in the context of their neighbouring genes in the pathway. In addition, we performed this analysis in each breast cancer molecular subtype, based on ER/HER2 status. The prognostic gene sets found in each subtype were consistent with the biology based on previous analysis of individual genes.
To date, most analyses of gene expression data have focused at the level of the individual genes. We show that a complementary approach of examining the data using predefined gene sets can reduce the noise and could provide increased insight into the underlying biological pathways.
不同的微阵列研究已经为预测一系列治疗方法和疾病的结果编制了基因列表。这些基因列表几乎没有重叠,表明来自任何一项研究的结果都不稳定。有人认为,潜在的途径本质上是相同的,与单个基因的表达相比,基因集的表达(即基因集的表达)可能更能提供预后信息,并有助于理解潜在的生物学过程。
我们试图检查基于基因集而非单个基因的预后特征的稳定性。我们根据定义好的基因集,根据转移风险对五个微阵列研究中的乳腺癌病例进行分类。使用我们所谓的集合统计量来汇总集合中的基因表达水平。结果预后基因集与单个基因列表一样具有预测性,但在数据集内的自举复制中表现出更一致的排序,在不同数据集之间产生更稳定的分类器,并且在生物学背景下可能更具可解释性,因为它们在通路中检查基因表达其相邻基因的背景。此外,我们还根据 ER/HER2 状态在每个乳腺癌分子亚型中进行了这种分析。在每个亚型中发现的预后基因集与基于先前对单个基因的分析的生物学一致。
迄今为止,大多数基因表达数据分析都集中在单个基因的水平上。我们表明,使用预定义基因集检查数据的互补方法可以减少噪声,并可能为潜在的生物学途径提供更多的深入了解。