Kong Sek Won, Pu William T, Park Peter J
Department of Cardiology 300 Longwood Avenue, Boston, MA 02115, USA.
Bioinformatics. 2006 Oct 1;22(19):2373-80. doi: 10.1093/bioinformatics/btl401. Epub 2006 Jul 28.
Several statistical methods that combine analysis of differential gene expression with biological knowledge databases have been proposed for a more rapid interpretation of expression data. However, most such methods are based on a series of univariate statistical tests and do not properly account for the complex structure of gene interactions.
We present a simple yet effective multivariate statistical procedure for assessing the correlation between a subspace defined by a group of genes and a binary phenotype. A subspace is deemed significant if the samples corresponding to different phenotypes are well separated in that subspace. The separation is measured using Hotelling's T(2) statistic, which captures the covariance structure of the subspace. When the dimension of the subspace is larger than that of the sample space, we project the original data to a smaller orthonormal subspace. We use this method to search through functional pathway subspaces defined by Reactome, KEGG, BioCarta and Gene Ontology. To demonstrate its performance, we apply this method to the data from two published studies, and visualize the results in the principal component space.
已经提出了几种将差异基因表达分析与生物知识数据库相结合的统计方法,以便更快速地解释表达数据。然而,大多数此类方法基于一系列单变量统计检验,并未充分考虑基因相互作用的复杂结构。
我们提出了一种简单而有效的多变量统计程序,用于评估由一组基因定义的子空间与二元表型之间的相关性。如果对应于不同表型的样本在该子空间中能够很好地分离,则该子空间被认为是显著的。使用Hotelling's T(2)统计量来衡量分离程度,该统计量捕获了子空间的协方差结构。当子空间的维度大于样本空间的维度时,我们将原始数据投影到一个较小的正交子空间。我们使用这种方法在由Reactome、KEGG、BioCarta和基因本体定义的功能通路子空间中进行搜索。为了证明其性能,我们将此方法应用于两项已发表研究的数据,并在主成分空间中可视化结果。