Suppr超能文献

随机文本并不表现出真正的齐普夫定律式的等级分布。

Random texts do not exhibit the real Zipf's law-like rank distribution.

机构信息

Departament de Llenguatges i Sistemes Informàtics, Universitat Politècnica de Catalunya, Barcelona, Catalonia, Spain.

出版信息

PLoS One. 2010 Mar 9;5(3):e9411. doi: 10.1371/journal.pone.0009411.

Abstract

BACKGROUND

Zipf's law states that the relationship between the frequency of a word in a text and its rank (the most frequent word has rank , the 2nd most frequent word has rank ,...) is approximately linear when plotted on a double logarithmic scale. It has been argued that the law is not a relevant or useful property of language because simple random texts - constructed by concatenating random characters including blanks behaving as word delimiters - exhibit a Zipf's law-like word rank distribution.

METHODOLOGY/PRINCIPAL FINDINGS: In this article, we examine the flaws of such putative good fits of random texts. We demonstrate - by means of three different statistical tests - that ranks derived from random texts and ranks derived from real texts are statistically inconsistent with the parameters employed to argue for such a good fit, even when the parameters are inferred from the target real text. Our findings are valid for both the simplest random texts composed of equally likely characters as well as more elaborate and realistic versions where character probabilities are borrowed from a real text.

CONCLUSIONS/SIGNIFICANCE: The good fit of random texts to real Zipf's law-like rank distributions has not yet been established. Therefore, we suggest that Zipf's law might in fact be a fundamental law in natural languages.

摘要

背景

齐夫定律指出,在双对数坐标上绘制文本中单词的频率与其等级(最常见的单词具有等级 ,第 2 常见的单词具有等级 ,……)之间的关系大致呈线性关系。有人认为,该定律不是语言的相关或有用属性,因为简单的随机文本——通过串联包括空格在内的随机字符构建而成,空格充当单词分隔符——表现出类似于齐夫定律的单词等级分布。

方法/主要发现:在本文中,我们检查了这种假定的随机文本良好拟合的缺陷。我们通过三种不同的统计检验证明,来自随机文本的等级和来自真实文本的等级与用于支持这种良好拟合的参数在统计上不一致,即使这些参数是从目标真实文本中推断出来的。我们的发现适用于由等概率字符组成的最简单的随机文本以及更复杂和现实的版本,其中字符概率是从真实文本中借用的。

结论/意义:随机文本与真实齐夫定律等级分布的良好拟合尚未得到证实。因此,我们建议齐夫定律实际上可能是自然语言中的基本定律。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8ef7/2834740/a36b2e7b71b5/pone.0009411.g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验