Knijnenburg Theo A, Wessels Lodewyk F A, Reinders Marcel J T, Shmulevich Ilya
Institute for Systems Biology, Seattle, WA, USA.
Bioinformatics. 2009 Jun 15;25(12):i161-8. doi: 10.1093/bioinformatics/btp211.
Permutation tests have become a standard tool to assess the statistical significance of an event under investigation. The statistical significance, as expressed in a P-value, is calculated as the fraction of permutation values that are at least as extreme as the original statistic, which was derived from non-permuted data. This empirical method directly couples both the minimal obtainable P-value and the resolution of the P-value to the number of permutations. Thereby, it imposes upon itself the need for a very large number of permutations when small P-values are to be accurately estimated. This is computationally expensive and often infeasible.
A method of computing P-values based on tail approximation is presented. The tail of the distribution of permutation values is approximated by a generalized Pareto distribution. A good fit and thus accurate P-value estimates can be obtained with a drastically reduced number of permutations when compared with the standard empirical way of computing P-values.
The Matlab code can be obtained from the corresponding author on request.
Supplementary data are available at Bioinformatics online.
排列检验已成为评估所研究事件统计显著性的标准工具。以P值表示的统计显著性是通过排列值中至少与原始统计量一样极端的排列值所占比例来计算的,原始统计量是从未排列的数据中得出的。这种经验方法直接将最小可获得的P值和P值的分辨率与排列次数联系起来。因此,当要准确估计小P值时,就需要进行大量的排列。这在计算上成本很高,而且通常不可行。
提出了一种基于尾部近似计算P值的方法。排列值分布的尾部由广义帕累托分布近似。与计算P值的标准经验方法相比,使用大幅减少的排列次数就能获得良好的拟合,从而得到准确的P值估计。
可应要求从相应作者处获取Matlab代码。
补充数据可在《生物信息学》在线获取。