Suppr超能文献

单词中字母的统计力学

Statistical mechanics of letters in words.

作者信息

Stephens Greg J, Bialek William

机构信息

Joseph Henry Laboratories of Physics, Princeton University, Princeton, New Jersey 08544, USA.

出版信息

Phys Rev E Stat Nonlin Soft Matter Phys. 2010 Jun;81(6 Pt 2):066119. doi: 10.1103/PhysRevE.81.066119. Epub 2010 Jun 25.

Abstract

We consider words as a network of interacting letters, and approximate the probability distribution of states taken on by this network. Despite the intuition that the rules of English spelling are highly combinatorial and arbitrary, we find that maximum entropy models consistent with pairwise correlations among letters provide a surprisingly good approximation to the full statistics of words, capturing ∼92% of the multi-information in four-letter words and even "discovering" words that were not represented in the data. These maximum entropy models incorporate letter interactions through a set of pairwise potentials and thus define an energy landscape on the space of possible words. Guided by the large letter redundancy we seek a lower-dimensional encoding of the letter distribution and show that distinctions between local minima in the landscape account for ∼68% of the four-letter entropy. We suggest that these states provide an effective vocabulary which is matched to the frequency of word use and much smaller than the full lexicon.

摘要

我们将单词视为由相互作用的字母组成的网络,并近似该网络所呈现状态的概率分布。尽管直觉上认为英语拼写规则具有高度的组合性和任意性,但我们发现,与字母间成对相关性一致的最大熵模型能令人惊讶地很好逼近单词的完整统计信息,捕获了四字母单词中约92%的多信息,甚至“发现”了数据中未出现的单词。这些最大熵模型通过一组成对势来纳入字母间的相互作用,从而在可能的单词空间上定义了一个能量景观。受字母大量冗余的引导,我们寻求字母分布的低维编码,并表明景观中局部最小值之间的差异占四字母熵的约68%。我们认为这些状态提供了一个有效的词汇表,它与单词使用频率相匹配,且比完整的词汇表小得多。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验