随机文本并不表现出真正的齐普夫定律式的等级分布。

Random texts do not exhibit the real Zipf's law-like rank distribution.

机构信息

Departament de Llenguatges i Sistemes Informàtics, Universitat Politècnica de Catalunya, Barcelona, Catalonia, Spain.

出版信息

PLoS One. 2010 Mar 9;5(3):e9411. doi: 10.1371/journal.pone.0009411.

DOI:10.1371/journal.pone.0009411

PMID:20231884

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2834740/

Abstract

BACKGROUND

Zipf's law states that the relationship between the frequency of a word in a text and its rank (the most frequent word has rank , the 2nd most frequent word has rank ,...) is approximately linear when plotted on a double logarithmic scale. It has been argued that the law is not a relevant or useful property of language because simple random texts - constructed by concatenating random characters including blanks behaving as word delimiters - exhibit a Zipf's law-like word rank distribution.

METHODOLOGY/PRINCIPAL FINDINGS: In this article, we examine the flaws of such putative good fits of random texts. We demonstrate - by means of three different statistical tests - that ranks derived from random texts and ranks derived from real texts are statistically inconsistent with the parameters employed to argue for such a good fit, even when the parameters are inferred from the target real text. Our findings are valid for both the simplest random texts composed of equally likely characters as well as more elaborate and realistic versions where character probabilities are borrowed from a real text.

CONCLUSIONS/SIGNIFICANCE: The good fit of random texts to real Zipf's law-like rank distributions has not yet been established. Therefore, we suggest that Zipf's law might in fact be a fundamental law in natural languages.

摘要

背景

齐夫定律指出，在双对数坐标上绘制文本中单词的频率与其等级（最常见的单词具有等级，第 2 常见的单词具有等级，……）之间的关系大致呈线性关系。有人认为，该定律不是语言的相关或有用属性，因为简单的随机文本——通过串联包括空格在内的随机字符构建而成，空格充当单词分隔符——表现出类似于齐夫定律的单词等级分布。

方法/主要发现：在本文中，我们检查了这种假定的随机文本良好拟合的缺陷。我们通过三种不同的统计检验证明，来自随机文本的等级和来自真实文本的等级与用于支持这种良好拟合的参数在统计上不一致，即使这些参数是从目标真实文本中推断出来的。我们的发现适用于由等概率字符组成的最简单的随机文本以及更复杂和现实的版本，其中字符概率是从真实文本中借用的。

结论/意义：随机文本与真实齐夫定律等级分布的良好拟合尚未得到证实。因此，我们建议齐夫定律实际上可能是自然语言中的基本定律。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8ef7/2834740/a36b2e7b71b5/pone.0009411.g001.jpg

相似文献

Random texts do not exhibit the real Zipf's law-like rank distribution.

PLoS One. 2010 Mar 9;5(3):e9411. doi: 10.1371/journal.pone.0009411.

Zipf's Law for Word Frequencies: Word Forms versus Lemmas in Long Texts.

PLoS One. 2015 Jul 9;10(7):e0129031. doi: 10.1371/journal.pone.0129031. eCollection 2015.

Large-Scale Analysis of Zipf's Law in English Texts.

PLoS One. 2016 Jan 22;11(1):e0147073. doi: 10.1371/journal.pone.0147073. eCollection 2016.

Zipf's law revisited: Spoken dialog, linguistic units, parameters, and the principle of least effort.

Psychon Bull Rev. 2023 Feb;30(1):77-101. doi: 10.3758/s13423-022-02142-9. Epub 2022 Jul 15.

Zipf's Law and Avoidance of Excessive Synonymy.

Cogn Sci. 2008 Oct;32(7):1075-98. doi: 10.1080/03640210802020003.

Zipf's word frequency law in natural language: a critical review and future directions.

Psychon Bull Rev. 2014 Oct;21(5):1112-30. doi: 10.3758/s13423-014-0585-6.

Solvable null model for the distribution of word frequencies.

Phys Rev E Stat Nonlin Soft Matter Phys. 2004 Oct;70(4 Pt 1):042901. doi: 10.1103/PhysRevE.70.042901. Epub 2004 Oct 25.

Zipf's Law Arises Naturally When There Are Underlying, Unobserved Variables.

PLoS Comput Biol. 2016 Dec 20;12(12):e1005110. doi: 10.1371/journal.pcbi.1005110. eCollection 2016 Dec.

Zipf's law leads to Heaps' law: analyzing their relation in finite-size systems.

PLoS One. 2010 Dec 2;5(12):e14139. doi: 10.1371/journal.pone.0014139.

The languages of health in general practice electronic patient records: a Zipf's law analysis.

J Biomed Semantics. 2014 Jan 10;5(1):2. doi: 10.1186/2041-1480-5-2.

引用本文的文献

Application of elementary probability models for text homogeneity and segmentation: A case study of Bible.

PLoS One. 2024 Jun 7;19(6):e0303432. doi: 10.1371/journal.pone.0303432. eCollection 2024.

Language-like efficiency and structure in house finch song.

Proc Biol Sci. 2024 Apr 10;291(2020):20240250. doi: 10.1098/rspb.2024.0250. Epub 2024 Apr 3.

On the fractal patterns of language structures.

PLoS One. 2023 May 18;18(5):e0285630. doi: 10.1371/journal.pone.0285630. eCollection 2023.

Efficiency in human languages: Corpus evidence for universal principles.

Linguist Vanguard. 2021 Apr 21;7(Suppl3):20200081. doi: 10.1515/lingvan-2020-0081. eCollection 2021 May 1.

Zipf's law revisited: Spoken dialog, linguistic units, parameters, and the principle of least effort.

Psychon Bull Rev. 2023 Feb;30(1):77-101. doi: 10.3758/s13423-022-02142-9. Epub 2022 Jul 15.

The Voynich manuscript: Symbol roles revisited.

PLoS One. 2022 Jan 27;17(1):e0260948. doi: 10.1371/journal.pone.0260948. eCollection 2022.

Linguistic laws in biology.

Trends Ecol Evol. 2022 Jan;37(1):53-66. doi: 10.1016/j.tree.2021.08.012. Epub 2021 Sep 28.

Entropy Estimation Using a Linguistic Zipf-Mandelbrot-Li Model for Natural Sequences.

Entropy (Basel). 2021 Aug 24;23(9):1100. doi: 10.3390/e23091100.

Can Menzerath's law be a criterion of complexity in communication?

PLoS One. 2021 Aug 20;16(8):e0256133. doi: 10.1371/journal.pone.0256133. eCollection 2021.

From Boltzmann to Zipf through Shannon and Jaynes.

Entropy (Basel). 2020 Feb 5;22(2):179. doi: 10.3390/e22020179.

本文引用的文献

Zipf's Law and Avoidance of Excessive Synonymy.

Cogn Sci. 2008 Oct;32(7):1075-98. doi: 10.1080/03640210802020003.

Some effects of intermittent silence.

Am J Psychol. 1957 Jun;70(2):311-4.

Least effort and the origins of scaling in human language.

Proc Natl Acad Sci U S A. 2003 Feb 4;100(3):788-91. doi: 10.1073/pnas.0335980100. Epub 2003 Jan 22.

Spoken word production: a theory of lexical access.

Proc Natl Acad Sci U S A. 2001 Nov 6;98(23):13464-71. doi: 10.1073/pnas.231459498.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

随机文本并不表现出真正的齐普夫定律式的等级分布。

Random texts do not exhibit the real Zipf's law-like rank distribution.

机构信息

出版信息

BACKGROUND

背景

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献