Tamayo Pablo, Steinhardt George, Liberzon Arthur, Mesirov Jill P
The Eli and Edythe L. Broad Institute of Massachusetts Institute of Technology and Harvard University, Cambridge, MA, USA
Boston University Bioinformatics Program, Boston University, Boston, MA, USA.
Stat Methods Med Res. 2016 Feb;25(1):472-87. doi: 10.1177/0962280212460441. Epub 2012 Oct 14.
Since its first publication in 2003, the Gene Set Enrichment Analysis method, based on the Kolmogorov-Smirnov statistic, has been heavily used, modified, and also questioned. Recently a simplified approach using a one-sample t-test score to assess enrichment and ignoring gene-gene correlations was proposed by Irizarry et al. 2009 as a serious contender. The argument criticizes Gene Set Enrichment Analysis's nonparametric nature and its use of an empirical null distribution as unnecessary and hard to compute. We refute these claims by careful consideration of the assumptions of the simplified method and its results, including a comparison with Gene Set Enrichment Analysis's on a large benchmark set of 50 datasets. Our results provide strong empirical evidence that gene-gene correlations cannot be ignored due to the significant variance inflation they produced on the enrichment scores and should be taken into account when estimating gene set enrichment significance. In addition, we discuss the challenges that the complex correlation structure and multi-modality of gene sets pose more generally for gene set enrichment methods.
自2003年首次发表以来,基于柯尔莫哥洛夫-斯米尔诺夫统计量的基因集富集分析方法得到了大量应用、改进,同时也受到了质疑。最近,Irizarry等人于2009年提出了一种简化方法,该方法使用单样本t检验分数来评估富集情况,并且忽略基因-基因相关性,被视为一种有力的竞争方法。该观点批评基因集富集分析的非参数性质及其使用经验性零分布是不必要的且难以计算。我们通过仔细考虑简化方法的假设及其结果,包括与基因集富集分析在50个数据集的大型基准集上进行比较,反驳了这些说法。我们的结果提供了强有力的经验证据,表明基因-基因相关性不能被忽略,因为它们会在富集分数上产生显著的方差膨胀,并且在估计基因集富集显著性时应予以考虑。此外,我们还讨论了基因集的复杂相关结构和多模态给基因集富集方法带来的更普遍挑战。