Larson Jessica L, Owen Art B
Department of Bioinformatics and Computational Biology, Genentech, Inc., South San Francisco, USA.
Currently at GenePeeks, Inc., Cambridge, USA.
BMC Bioinformatics. 2015 Apr 28;16:132. doi: 10.1186/s12859-015-0571-7.
Permutation-based gene set tests are standard approaches for testing relationships between collections of related genes and an outcome of interest in high throughput expression analyses. Using M random permutations, one can attain p-values as small as 1/(M+1). When many gene sets are tested, we need smaller p-values, hence larger M, to achieve significance while accounting for the number of simultaneous tests being made. As a result, the number of permutations to be done rises along with the cost per permutation. To reduce this cost, we seek parametric approximations to the permutation distributions for gene set tests.
We study two gene set methods based on sums and sums of squared correlations. The statistics we study are among the best performers in the extensive simulation of 261 gene set methods by Ackermann and Strimmer in 2009. Our approach calculates exact relevant moments of these statistics and uses them to fit parametric distributions. The computational cost of our algorithm for the linear case is on the order of doing |G| permutations, where |G| is the number of genes in set G. For the quadratic statistics, the cost is on the order of |G|(2) permutations which can still be orders of magnitude faster than plain permutation sampling. We applied the permutation approximation method to three public Parkinson's Disease expression datasets and discovered enriched gene sets not previously discussed. We found that the moment-based gene set enrichment p-values closely approximate the permutation method p-values at a tiny fraction of their cost. They also gave nearly identical rankings to the gene sets being compared.
We have developed a moment based approximation to linear and quadratic gene set test statistics' permutation distribution. This allows approximate testing to be done orders of magnitude faster than one could do by sampling permutations. We have implemented our method as a publicly available Bioconductor package, npGSEA (www.bioconductor.org) .
基于排列的基因集测试是在高通量表达分析中测试相关基因集合与感兴趣结果之间关系的标准方法。通过M次随机排列,可以得到低至1/(M + 1)的p值。当测试多个基因集时,我们需要更小的p值,因此需要更大的M来达到显著性,同时要考虑到进行的同步测试数量。结果,所需的排列次数会随着每次排列的成本而增加。为了降低成本,我们寻求基因集测试排列分布的参数近似值。
我们研究了基于总和以及平方相关和的两种基因集方法。我们所研究的统计量是2009年阿克曼和施特里默对261种基因集方法进行广泛模拟时表现最佳的统计量之一。我们的方法计算这些统计量的精确相关矩,并使用它们来拟合参数分布。对于线性情况,我们算法的计算成本约为进行|G|次排列,其中|G|是集合G中的基因数量。对于二次统计量,成本约为|G|(2)次排列,这仍然比普通排列抽样快几个数量级。我们将排列近似方法应用于三个公开的帕金森病表达数据集,并发现了以前未讨论过的富集基因集。我们发现基于矩的基因集富集p值以极低的成本紧密近似排列方法的p值。它们对所比较的基因集也给出了几乎相同的排名。
我们已经开发出一种基于矩的近似方法,用于线性和二次基因集测试统计量的排列分布。这使得近似测试比通过抽样排列进行测试的速度快几个数量级。我们已将我们的方法实现为一个公开可用的生物导体包npGSEA(www.bioconductor.org)。