Department of Population Medicine, Harvard Medical School and Harvard Pilgrim Health Care Institute, Boston, USA.
Int J Health Geogr. 2010 Dec 17;9:61. doi: 10.1186/1476-072X-9-61.
The spatial and space-time scan statistics are commonly applied for the detection of geographical disease clusters. Monte Carlo hypothesis testing is typically used to test whether the geographical clusters are statistically significant as there is no known way to calculate the null distribution analytically. In Monte Carlo hypothesis testing, simulated random data are generated multiple times under the null hypothesis, and the p-value is r/(R + 1), where R is the number of simulated random replicates of the data and r is the rank of the test statistic from the real data compared to the same test statistics calculated from each of the random data sets. A drawback to this powerful technique is that each additional digit of p-value precision requires ten times as many replicated datasets, and the additional processing can lead to excessive run times.
We propose a new method for obtaining more precise p-values with a given number of replicates. The collection of test statistics from the random replicates is used to estimate the true distribution of the test statistic under the null hypothesis by fitting a continuous distribution to these observations. The choice of distribution is critical, and for the spatial and space-time scan statistics, the extreme value Gumbel distribution performs very well while the gamma, normal and lognormal distributions perform poorly. From the fitted Gumbel distribution, we show that it is possible to estimate the analytical p-value with great precision even when the test statistic is far out in the tail beyond any of the test statistics observed in the simulated replicates. In addition, Gumbel-based rejection probabilities have smaller variability than Monte Carlo-based rejection probabilities, suggesting that the proposed approach may result in greater power than the true Monte Carlo hypothesis test for a given number of replicates.
For large data sets, it is often advantageous to replace computer intensive Monte Carlo hypothesis testing with this new method of fitting a Gumbel distribution to random data sets generated under the null, in order to reduce computation time and obtain much more precise p-values and slightly higher statistical power.
空间和时空扫描统计通常用于检测地理疾病集群。由于无法从理论上计算零假设分布,通常采用蒙特卡罗假设检验来检验地理集群是否具有统计学意义。在蒙特卡罗假设检验中,根据零假设多次生成模拟随机数据,p 值为 r/(R+1),其中 R 是数据模拟随机重复的次数,r 是真实数据与每个随机数据集计算的相同检验统计量相比的检验统计量的秩。这种强大技术的一个缺点是,p 值精度的每增加一位需要增加十倍的重复数据集,并且额外的处理可能会导致运行时间过长。
我们提出了一种在给定重复次数的情况下获得更精确 p 值的新方法。从随机重复中收集检验统计量,通过拟合连续分布来估计零假设下检验统计量的真实分布。分布的选择至关重要,对于空间和时空扫描统计,极值 Gumbel 分布表现非常好,而伽马、正态和对数正态分布表现不佳。从拟合的 Gumbel 分布中,我们可以看到,即使检验统计量远远超出了模拟重复中观察到的任何检验统计量的尾部,也可以非常精确地估计分析 p 值。此外,基于 Gumbel 的拒绝概率比基于蒙特卡罗的拒绝概率具有更小的可变性,这表明与给定数量的重复相比,该方法可能会导致更大的功效。
对于大型数据集,用这种新方法代替计算密集型的蒙特卡罗假设检验是有利的,即用拟合零假设下生成的随机数据集的 Gumbel 分布来代替,以减少计算时间并获得更精确的 p 值和略高的统计功效。