Department of Biostatistics and Computational Biology, University of Rochester Medical Center, Rochester, NY 14642, USA.
Bioinformatics. 2009 Sep 15;25(18):2348-54. doi: 10.1093/bioinformatics/btp406. Epub 2009 Jul 2.
Recently, many univariate and several multivariate approaches have been suggested for testing differential expression of gene sets between different phenotypes. However, despite a wealth of literature studying their performance on simulated and real biological data, still there is a need to quantify their relative performance when they are testing different null hypotheses.
In this article, we compare the performance of univariate and multivariate tests on both simulated and biological data. In the simulation study we demonstrate that high correlations equally affect the power of both, univariate as well as multivariate tests. In addition, for most of them the power is similarly affected by the dimensionality of the gene set and by the percentage of genes in the set, for which expression is changing between two phenotypes. The application of different test statistics to biological data reveals that three statistics (sum of squared t-tests, Hotelling's T(2), N-statistic), testing different null hypotheses, find some common but also some complementing differentially expressed gene sets under specific settings. This demonstrates that due to complementing null hypotheses each test projects on different aspects of the data and for the analysis of biological data it is beneficial to use all three tests simultaneously instead of focusing exclusively on just one.
最近,已经提出了许多单变量和几种多变量方法来检验不同表型之间基因集的差异表达。然而,尽管有大量文献研究了它们在模拟和真实生物数据上的性能,但在检验不同零假设时,仍然需要量化它们的相对性能。
在本文中,我们比较了单变量和多变量检验在模拟和生物数据上的性能。在模拟研究中,我们证明了高相关性同样会影响单变量和多变量检验的功效。此外,对于大多数检验,基因集的维数和基因集中表达在两种表型之间变化的基因百分比同样会影响功效。对不同测试统计量在生物数据上的应用表明,三种检验统计量(t 检验平方和、Hotelling's T(2)、N 统计量)检验不同的零假设,在特定条件下找到了一些共同的但也有互补的差异表达基因集。这表明,由于互补的零假设,每个检验都针对数据的不同方面,因此对于生物数据的分析,同时使用这三种检验而不是只关注其中一种会更有益。