Theoretical Biology and Biophysics, Los Alamos National Laboratory, Los Alamos, NM, USA.
Michigan Institute for Data Science, University of Michigan, Ann Arbor, MI 48109-1382, USA.
J R Soc Interface. 2023 Aug;20(205):20230310. doi: 10.1098/rsif.2023.0310. Epub 2023 Aug 30.
Despite widespread claims of power laws across the natural and social sciences, evidence in data is often equivocal. Modern data and statistical methods reject even classic power laws such as Pareto's law of wealth and the Gutenberg-Richter law for earthquake magnitudes. We show that the maximum-likelihood estimators and Kolmogorov-Smirnov (K-S) statistics in widespread use are unexpectedly sensitive to ubiquitous errors in data such as measurement noise, quantization noise, heaping and censorship of small values. This sensitivity causes spurious rejection of power laws and biases parameter estimates even in arbitrarily large samples, which explains inconsistencies between theory and data. We show that logarithmic binning by powers of > 1 attenuates these errors in a manner analogous to noise averaging in normal statistics and that thereby tunes a trade-off between accuracy and precision in estimation. Binning also removes potentially misleading within-scale information while preserving information about the shape of a distribution over powers of , and we show that some amount of binning can improve sensitivity and specificity of K-S tests without any cost, while more extreme binning tunes a trade-off between sensitivity and specificity. We therefore advocate logarithmic binning as a simple essential step in power-law inference.
尽管自然科学和社会科学领域普遍声称存在幂律,但数据中的证据往往存在争议。现代数据和统计方法甚至拒绝了经典的幂律,如财富的帕累托定律和地震震级的古滕伯格-里希特定律。我们表明,广泛使用的最大似然估计量和柯尔莫哥洛夫-斯米尔诺夫(K-S)统计量对数据中的普遍存在的误差(如测量噪声、量化噪声、堆积和小值的屏蔽)非常敏感。这种敏感性导致对幂律的虚假拒绝,并使参数估计产生偏差,即使在任意大的样本中也是如此,这解释了理论和数据之间的不一致性。我们表明,以 > 1 的幂次进行对数分箱可以以类似于正态统计中的噪声平均的方式减轻这些误差,从而在估计的准确性和精度之间进行权衡。分箱还可以去除潜在的误导性的尺度内信息,同时保留关于分布在幂次上的形状的信息,我们表明,一些分箱量可以提高 K-S 检验的灵敏度和特异性,而不会有任何成本,而更极端的分箱则可以在灵敏度和特异性之间进行权衡。因此,我们主张对数分箱作为幂律推断的简单基本步骤。