Science for Life Laboratory, School of Engineering Sciences in Chemistry, Biotechnology and Health, KTH - Royal Institute of Technology, 171 21 Solna, Sweden.
Department of Mathematics, Stockholm University, 106 91 Stockholm, Sweden.
Bioinformatics. 2021 Apr 1;36(22-23):5392-5397. doi: 10.1093/bioinformatics/btaa1007.
Permutation tests offer a straightforward framework to assess the significance of differences in sample statistics. A significant advantage of permutation tests are the relatively few assumptions about the distribution of the test statistic are needed, as they rely on the assumption of exchangeability of the group labels. They have great value, as they allow a sensitivity analysis to determine the extent to which the assumed broad sample distribution of the test statistic applies. However, in this situation, permutation tests are rarely applied because the running time of naïve implementations is too slow and grows exponentially with the sample size. Nevertheless, continued development in the 1980s introduced dynamic programming algorithms that compute exact permutation tests in polynomial time. Albeit this significant running time reduction, the exact test has not yet become one of the predominant statistical tests for medium sample size. Here, we propose a computational parallelization of one such dynamic programming-based permutation test, the Green algorithm, which makes the permutation test more attractive.
Parallelization of the Green algorithm was found possible by non-trivial rearrangement of the structure of the algorithm. A speed-up-by orders of magnitude-is achievable by executing the parallelized algorithm on a GPU. We demonstrate that the execution time essentially becomes a non-issue for sample sizes, even as high as hundreds of samples. This improvement makes our method an attractive alternative to, e.g. the widely used asymptotic Mann-Whitney U-test.
In Python 3 code from the GitHub repository https://github.com/statisticalbiotechnology/parallelPermutationTest under an Apache 2.0 license.
Supplementary data are available at Bioinformatics online.
排列检验为评估样本统计数据差异的显著性提供了一个直接的框架。排列检验的一个显著优势是,它们对检验统计量分布的假设相对较少,因为它们依赖于组标签可交换性的假设。它们具有很大的价值,因为它们允许进行敏感性分析,以确定所假设的检验统计量的广泛样本分布在多大程度上适用。然而,在这种情况下,很少应用排列检验,因为天真实现的运行时间太慢,并且随样本量呈指数增长。尽管在 20 世纪 80 年代继续发展,引入了计算精确排列检验的动态规划算法,但多项式时间。尽管运行时间显著减少,但精确检验尚未成为中等样本量的主要统计检验方法之一。在这里,我们提出了一种基于动态规划的排列检验,即 Green 算法的计算并行化,这使得排列检验更具吸引力。
通过对算法结构进行非平凡的重新排列,发现 Green 算法可以进行并行化。通过在 GPU 上执行并行化算法,可以实现数量级的加速。我们证明,即使对于高达数百个样本的样本大小,执行时间基本上也不再是一个问题。这种改进使得我们的方法成为一种有吸引力的替代方法,例如广泛使用的渐近 Mann-Whitney U 检验。
在 Python 3 代码中,可从 GitHub 存储库 https://github.com/statisticalbiotechnology/parallelPermutationTest 获得,许可证为 Apache 2.0。
补充数据可在 Bioinformatics 在线获得。