Suppr超能文献

作为标度律的简洁法则以及词频齐普夫定律的一种可能起源。

The Brevity Law as a Scaling Law, and a Possible Origin of Zipf's Law for Word Frequencies.

作者信息

Corral Álvaro, Serra Isabel

机构信息

Centre de Recerca Matemàtica, Edifici C, Campus Bellaterra, E-08193 Barcelona, Spain.

Departament de Matemàtiques, Facultat de Ciències, Universitat Autònoma de Barcelona, E-08193 Barcelona, Spain.

出版信息

Entropy (Basel). 2020 Feb 17;22(2):224. doi: 10.3390/e22020224.

Abstract

An important body of quantitative linguistics is constituted by a series of statistical laws about language usage. Despite the importance of these linguistic laws, some of them are poorly formulated, and, more importantly, there is no unified framework that encompasses all them. This paper presents a new perspective to establish a connection between different statistical linguistic laws. Characterizing each word type by two random variables-length (in number of characters) and absolute frequency-we show that the corresponding bivariate joint probability distribution shows a rich and precise phenomenology, with the type-length and the type-frequency distributions as its two marginals, and the conditional distribution of frequency at fixed length providing a clear formulation for the brevity-frequency phenomenon. The type-length distribution turns out to be well fitted by a gamma distribution (much better than with the previously proposed lognormal), and the conditional frequency distributions at fixed length display power-law-decay behavior with a fixed exponent α ≃ 1.4 and a characteristic-frequency crossover that scales as an inverse power δ ≃ 2.8 of length, which implies the fulfillment of a scaling law analogous to those found in the thermodynamics of critical phenomena. As a by-product, we find a possible model-free explanation for the origin of Zipf's law, which should arise as a mixture of conditional frequency distributions governed by the crossover length-dependent frequency.

摘要

定量语言学的一个重要分支是由一系列关于语言使用的统计规律构成的。尽管这些语言规律很重要,但其中一些表述欠佳,更重要的是,不存在一个涵盖所有这些规律的统一框架。本文提出了一个新的视角来建立不同统计语言规律之间的联系。通过两个随机变量——长度(字符数)和绝对频率来表征每种词类,我们表明相应的二元联合概率分布呈现出丰富而精确的现象学特征,其中类型 - 长度分布和类型 - 频率分布是其两个边缘分布,并且固定长度下频率的条件分布为简洁 - 频率现象提供了清晰的表述。结果表明,类型 - 长度分布能很好地用伽马分布拟合(比之前提出的对数正态分布要好得多),固定长度下的条件频率分布呈现幂律衰减行为,具有固定指数α≃1.4和特征频率交叉,其尺度为长度的幂次倒数δ≃2.8,这意味着满足了与临界现象热力学中发现的那些类似的标度律。作为一个副产品,我们找到了齐普夫定律起源的一种可能的无模型解释,它应该是由依赖于交叉长度的频率所支配的条件频率分布的混合结果。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/74a3/7516654/f05b9d1dc6c0/entropy-22-00224-g001.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验