Chen Xi, Wang Lily, Smith Jonathan D, Zhang Bing
Department of Quantitative Health Sciences, The Cleveland Clinic, 9500 Euclid Ave. Cleveland, OH 44195, USA.
Bioinformatics. 2008 Nov 1;24(21):2474-81. doi: 10.1093/bioinformatics/btn458. Epub 2008 Aug 27.
Gene set analysis allows formal testing of subtle but coordinated changes in a group of genes, such as those defined by Gene Ontology (GO) or KEGG Pathway databases. We propose a new method for gene set analysis that is based on principal component analysis (PCA) of genes expression values in the gene set. PCA is an effective method for reducing high dimensionality and capture variations in gene expression values. However, one limitation with PCA is that the latent variable identified by the first PC may be unrelated to outcome.
In the proposed supervised PCA (SPCA) model for gene set analysis, the PCs are estimated from a selected subset of genes that are associated with outcome. As outcome information is used in the gene selection step, this method is supervised, thus called the Supervised PCA model. Because of the gene selection step, test statistic in SPCA model can no longer be approximated well using t-distribution. We propose a two-component mixture distribution based on Gumbel exteme value distributions to account for the gene selection step. We show the proposed method compares favorably to currently available gene set analysis methods using simulated and real microarray data.
The R code for the analysis used in this article are available upon request, we are currently working on implementing the proposed method in an R package.
基因集分析允许对一组基因中细微但协调的变化进行形式化检验,例如由基因本体论(GO)或KEGG通路数据库定义的那些基因。我们提出了一种基于基因集中基因表达值主成分分析(PCA)的基因集分析新方法。PCA是一种降低高维性并捕捉基因表达值变化的有效方法。然而,PCA的一个局限性在于由第一主成分识别的潜在变量可能与结果无关。
在所提出的用于基因集分析的监督主成分分析(SPCA)模型中,主成分是从与结果相关的选定基因子集中估计出来的。由于在基因选择步骤中使用了结果信息,该方法是有监督的,因此称为监督主成分分析模型。由于基因选择步骤,SPCA模型中的检验统计量不再能用t分布很好地近似。我们提出了一种基于耿贝尔极值分布的双组分混合分布来考虑基因选择步骤。我们表明,使用模拟和真实微阵列数据,所提出的方法优于目前可用的基因集分析方法。
本文中使用的分析的R代码可根据要求提供,我们目前正在努力将所提出的方法在一个R包中实现。