Corral Álvaro, Serra Isabel, Ferrer-I-Cancho Ramon
Centre de Recerca Matemàtica, Edifici C, Campus Bellaterra, E-08193 Barcelona, Spain.
Departament de Matemàtiques, Facultat de Ciències, Universitat Autònoma de Barcelona, E-08193 Barcelona, Spain.
Phys Rev E. 2020 Nov;102(5-1):052113. doi: 10.1103/PhysRevE.102.052113.
In recent years, researchers have realized the difficulties of fitting power-law distributions properly. These difficulties are higher in Zipfian systems, due to the discreteness of the variables and to the existence of two representations for these systems, i.e., two versions depending on the random variable to fit: rank or size. The discreteness implies that a power law in one of the representations is not a power law in the other, and vice versa. We generate synthetic power laws in both representations and apply a state-of-the-art fitting method to each of the two random variables. The method (based on maximum likelihood plus a goodness-of-fit test) does not fit the whole distribution but the tail, understood as the part of a distribution above a cutoff that separates non-power-law behavior from power-law behavior. We find that, no matter which random variable is power-law distributed, using the rank as the random variable is problematic for fitting, in general (although it may work in some limit cases). One of the difficulties comes from recovering the "hidden" true ranks from the empirical ranks. On the contrary, the representation in terms of the distribution of sizes allows one to recover the true exponent (with some small bias when the underlying size distribution is a power law only asymptotically).
近年来,研究人员已经意识到正确拟合幂律分布的困难。在齐普夫系统中,由于变量的离散性以及这些系统存在两种表示形式,即根据要拟合的随机变量(秩或规模)有两个版本,这些困难更为突出。离散性意味着一种表示形式中的幂律在另一种表示形式中并非幂律,反之亦然。我们在两种表示形式中生成合成幂律,并将一种先进的拟合方法应用于两个随机变量中的每一个。该方法(基于最大似然加上拟合优度检验)并非拟合整个分布,而是拟合尾部,尾部被理解为分布中高于将非幂律行为与幂律行为区分开的截止值的部分。我们发现,无论哪个随机变量呈幂律分布,一般来说,使用秩作为随机变量进行拟合都存在问题(尽管在某些极限情况下可能有效)。困难之一来自于从经验秩中恢复“隐藏”的真实秩。相反,用规模分布来表示能够让人们恢复真实指数(当基础规模分布仅是渐近幂律时会有一些小偏差)。