单词中字母的统计力学

Statistical mechanics of letters in words.

作者信息

Stephens Greg J, Bialek William

机构信息

Joseph Henry Laboratories of Physics, Princeton University, Princeton, New Jersey 08544, USA.

出版信息

Phys Rev E Stat Nonlin Soft Matter Phys. 2010 Jun;81(6 Pt 2):066119. doi: 10.1103/PhysRevE.81.066119. Epub 2010 Jun 25.

DOI:10.1103/PhysRevE.81.066119

PMID:20866490

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3648583/

Abstract

We consider words as a network of interacting letters, and approximate the probability distribution of states taken on by this network. Despite the intuition that the rules of English spelling are highly combinatorial and arbitrary, we find that maximum entropy models consistent with pairwise correlations among letters provide a surprisingly good approximation to the full statistics of words, capturing ∼92% of the multi-information in four-letter words and even "discovering" words that were not represented in the data. These maximum entropy models incorporate letter interactions through a set of pairwise potentials and thus define an energy landscape on the space of possible words. Guided by the large letter redundancy we seek a lower-dimensional encoding of the letter distribution and show that distinctions between local minima in the landscape account for ∼68% of the four-letter entropy. We suggest that these states provide an effective vocabulary which is matched to the frequency of word use and much smaller than the full lexicon.

摘要

我们将单词视为由相互作用的字母组成的网络，并近似该网络所呈现状态的概率分布。尽管直觉上认为英语拼写规则具有高度的组合性和任意性，但我们发现，与字母间成对相关性一致的最大熵模型能令人惊讶地很好逼近单词的完整统计信息，捕获了四字母单词中约92%的多信息，甚至“发现”了数据中未出现的单词。这些最大熵模型通过一组成对势来纳入字母间的相互作用，从而在可能的单词空间上定义了一个能量景观。受字母大量冗余的引导，我们寻求字母分布的低维编码，并表明景观中局部最小值之间的差异占四字母熵的约68%。我们认为这些状态提供了一个有效的词汇表，它与单词使用频率相匹配，且比完整的词汇表小得多。

相似文献

Statistical mechanics of letters in words.

Phys Rev E Stat Nonlin Soft Matter Phys. 2010 Jun;81(6 Pt 2):066119. doi: 10.1103/PhysRevE.81.066119. Epub 2010 Jun 25.

From Boltzmann to Zipf through Shannon and Jaynes.

Entropy (Basel). 2020 Feb 5;22(2):179. doi: 10.3390/e22020179.

Minimum and Maximum Entropy Distributions for Binary Systems with Known Means and Pairwise Correlations.

Entropy (Basel). 2017 Aug 21;19(8):427. doi: 10.3390/e19080427.

Silex: A database for silent-letter endings in French words.

Behav Res Methods. 2017 Oct;49(5):1894-1904. doi: 10.3758/s13428-016-0832-z.

A normalization model for repeated letters in social media hate speech text based on rules and spelling correction.

PLoS One. 2024 Mar 21;19(3):e0299652. doi: 10.1371/journal.pone.0299652. eCollection 2024.

Orthographic complexity and word naming in Italian: some words are more transparent than others.

Psychon Bull Rev. 2006 Apr;13(2):346-52. doi: 10.3758/bf03193855.

The effects of stimulus attributes upon latency of word recognition.

Br J Psychol. 1976 Aug;67(3):315-25. doi: 10.1111/j.2044-8295.1976.tb01518.x.

The word-detection effect: sophisticated guessing or perceptual enhancement?

Mem Cognit. 1996 May;24(3):331-41. doi: 10.3758/bf03213297.

Dominant words rise to the top by positive frequency-dependent selection.

Proc Natl Acad Sci U S A. 2019 Apr 9;116(15):7397-7402. doi: 10.1073/pnas.1816994116. Epub 2019 Mar 21.

Maximum entropy, word-frequency, Chinese characters, and multiple meanings.

PLoS One. 2015 May 8;10(5):e0125592. doi: 10.1371/journal.pone.0125592. eCollection 2015.

引用本文的文献

Inferring Cultural Landscapes with the Inverse Ising Model.

Entropy (Basel). 2023 Jan 31;25(2):264. doi: 10.3390/e25020264.

The Brevity Law as a Scaling Law, and a Possible Origin of Zipf's Law for Word Frequencies.

Entropy (Basel). 2020 Feb 17;22(2):224. doi: 10.3390/e22020224.

From Boltzmann to Zipf through Shannon and Jaynes.

Entropy (Basel). 2020 Feb 5;22(2):179. doi: 10.3390/e22020179.

Criticality in Pareto Optimal Grammars?

Entropy (Basel). 2020 Jan 31;22(2):165. doi: 10.3390/e22020165.

Turbulence through the Spyglass of Bilocal Kinetics.

Entropy (Basel). 2018 Jul 20;20(7):539. doi: 10.3390/e20070539.

Unsupervised inference approach to facial attractiveness.

PeerJ. 2020 Oct 28;8:e10210. doi: 10.7717/peerj.10210. eCollection 2020.

Hamiltonian modelling of macro-economic urban dynamics.

R Soc Open Sci. 2020 Sep 23;7(9):200667. doi: 10.1098/rsos.200667. eCollection 2020 Sep.

Strong evidence of an information-theoretical conservation principle linking all discrete systems.

R Soc Open Sci. 2019 Oct 23;6(10):191101. doi: 10.1098/rsos.191101. eCollection 2019 Oct.

An introduction to the maximum entropy approach and its application to inference problems in biology.

Heliyon. 2018 Apr 13;4(4):e00596. doi: 10.1016/j.heliyon.2018.e00596. eCollection 2018 Apr.

Maximum entropy models capture melodic styles.

Sci Rep. 2017 Aug 23;7(1):9172. doi: 10.1038/s41598-017-08028-4.

本文引用的文献

Using the principle of entropy maximization to infer genetic interaction networks from gene expression patterns.

Proc Natl Acad Sci U S A. 2006 Dec 12;103(50):19033-8. doi: 10.1073/pnas.0609152103. Epub 2006 Nov 30.

The structure of multi-neuron firing patterns in primate retina.

J Neurosci. 2006 Aug 9;26(32):8254-66. doi: 10.1523/JNEUROSCI.1282-06.2006.

Weak pairwise correlations imply strongly correlated network states in a neural population.

Nature. 2006 Apr 20;440(7087):1007-12. doi: 10.1038/nature04701. Epub 2006 Apr 9.

The neural code for written words: a proposal.

Trends Cogn Sci. 2005 Jul;9(7):335-41. doi: 10.1016/j.tics.2005.05.004.

Network information and connected correlations.

Phys Rev Lett. 2003 Dec 5;91(23):238701. doi: 10.1103/PhysRevLett.91.238701. Epub 2003 Dec 2.

Neural networks and physical systems with emergent collective computational abilities.

Proc Natl Acad Sci U S A. 1982 Apr;79(8):2554-8. doi: 10.1073/pnas.79.8.2554.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

单词中字母的统计力学

Statistical mechanics of letters in words.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献