Phipson Belinda, Smyth Gordon K
The Walter and Eliza Hall Institute of Medical Research.
Stat Appl Genet Mol Biol. 2010;9:Article39. doi: 10.2202/1544-6115.1585. Epub 2010 Oct 31.
Permutation tests are amongst the most commonly used statistical tools in modern genomic research, a process by which p-values are attached to a test statistic by randomly permuting the sample or gene labels. Yet permutation p-values published in the genomic literature are often computed incorrectly, understated by about 1/m, where m is the number of permutations. The same is often true in the more general situation when Monte Carlo simulation is used to assign p-values. Although the p-value understatement is usually small in absolute terms, the implications can be serious in a multiple testing context. The understatement arises from the intuitive but mistaken idea of using permutation to estimate the tail probability of the test statistic. We argue instead that permutation should be viewed as generating an exact discrete null distribution. The relevant literature, some of which is likely to have been relatively inaccessible to the genomic community, is reviewed and summarized. A computation strategy is developed for exact p-values when permutations are randomly drawn. The strategy is valid for any number of permutations and samples. Some simple recommendations are made for the implementation of permutation tests in practice.
排列检验是现代基因组研究中最常用的统计工具之一,该过程通过随机排列样本或基因标签来为检验统计量赋予p值。然而,基因组文献中公布的排列p值常常计算错误,被低估约1/m,其中m是排列的次数。在使用蒙特卡罗模拟来赋予p值的更一般情况下,情况通常也是如此。尽管p值的低估在绝对值上通常较小,但在多重检验的背景下,其影响可能很严重。这种低估源于使用排列来估计检验统计量的尾部概率这一直观但错误的想法。相反,我们认为排列应被视为生成一个精确的离散零分布。对相关文献进行了回顾和总结,其中一些文献可能基因组学界相对难以获取。当随机抽取排列时,开发了一种计算精确p值的策略。该策略对任何数量的排列和样本都有效。针对排列检验在实际中的实施提出了一些简单建议。