Corral Álvaro, Serra Isabel
Centre de Recerca Matemàtica, Edifici C, Campus Bellaterra, E-08193 Barcelona, Spain.
Departament de Matemàtiques, Facultat de Ciències, Universitat Autònoma de Barcelona, E-08193 Barcelona, Spain.
Entropy (Basel). 2020 Feb 17;22(2):224. doi: 10.3390/e22020224.
An important body of quantitative linguistics is constituted by a series of statistical laws about language usage. Despite the importance of these linguistic laws, some of them are poorly formulated, and, more importantly, there is no unified framework that encompasses all them. This paper presents a new perspective to establish a connection between different statistical linguistic laws. Characterizing each word type by two random variables-length (in number of characters) and absolute frequency-we show that the corresponding bivariate joint probability distribution shows a rich and precise phenomenology, with the type-length and the type-frequency distributions as its two marginals, and the conditional distribution of frequency at fixed length providing a clear formulation for the brevity-frequency phenomenon. The type-length distribution turns out to be well fitted by a gamma distribution (much better than with the previously proposed lognormal), and the conditional frequency distributions at fixed length display power-law-decay behavior with a fixed exponent α ≃ 1.4 and a characteristic-frequency crossover that scales as an inverse power δ ≃ 2.8 of length, which implies the fulfillment of a scaling law analogous to those found in the thermodynamics of critical phenomena. As a by-product, we find a possible model-free explanation for the origin of Zipf's law, which should arise as a mixture of conditional frequency distributions governed by the crossover length-dependent frequency.
定量语言学的一个重要分支是由一系列关于语言使用的统计规律构成的。尽管这些语言规律很重要,但其中一些表述欠佳,更重要的是,不存在一个涵盖所有这些规律的统一框架。本文提出了一个新的视角来建立不同统计语言规律之间的联系。通过两个随机变量——长度(字符数)和绝对频率来表征每种词类,我们表明相应的二元联合概率分布呈现出丰富而精确的现象学特征,其中类型 - 长度分布和类型 - 频率分布是其两个边缘分布,并且固定长度下频率的条件分布为简洁 - 频率现象提供了清晰的表述。结果表明,类型 - 长度分布能很好地用伽马分布拟合(比之前提出的对数正态分布要好得多),固定长度下的条件频率分布呈现幂律衰减行为,具有固定指数α≃1.4和特征频率交叉,其尺度为长度的幂次倒数δ≃2.8,这意味着满足了与临界现象热力学中发现的那些类似的标度律。作为一个副产品,我们找到了齐普夫定律起源的一种可能的无模型解释,它应该是由依赖于交叉长度的频率所支配的条件频率分布的混合结果。