最大熵、词频、汉字及多重含义。

Maximum entropy, word-frequency, Chinese characters, and multiple meanings.

作者信息

Yan Xiaoyong, Minnhagen Petter

机构信息

Systems Science Institute, Beijing Jiaotong University, Beijing 100044, China; Big Data Research Center, University of Electronic Science and Technology of China, Chengdu 611731, China.

IceLab, Department of Physics, Umeå University, 901 87 Umeå, Sweden.

出版信息

PLoS One. 2015 May 8;10(5):e0125592. doi: 10.1371/journal.pone.0125592. eCollection 2015.

DOI:10.1371/journal.pone.0125592

PMID:25955175

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4425542/

Abstract

The word-frequency distribution of a text written by an author is well accounted for by a maximum entropy distribution, the RGF (random group formation)-prediction. The RGF-distribution is completely determined by the a priori values of the total number of words in the text (M), the number of distinct words (N) and the number of repetitions of the most common word (k(max)). It is here shown that this maximum entropy prediction also describes a text written in Chinese characters. In particular it is shown that although the same Chinese text written in words and Chinese characters have quite differently shaped distributions, they are nevertheless both well predicted by their respective three a priori characteristic values. It is pointed out that this is analogous to the change in the shape of the distribution when translating a given text to another language. Another consequence of the RGF-prediction is that taking a part of a long text will change the input parameters (M, N, k(max)) and consequently also the shape of the frequency distribution. This is explicitly confirmed for texts written in Chinese characters. Since the RGF-prediction has no system-specific information beyond the three a priori values (M, N, k(max)), any specific language characteristic has to be sought in systematic deviations from the RGF-prediction and the measured frequencies. One such systematic deviation is identified and, through a statistical information theoretical argument and an extended RGF-model, it is proposed that this deviation is caused by multiple meanings of Chinese characters. The effect is stronger for Chinese characters than for Chinese words. The relation between Zipf's law, the Simon-model for texts and the present results are discussed.

摘要

作者所写文本的词频分布可以通过最大熵分布——随机分组形成（RGF）预测得到很好的解释。RGF分布完全由文本中的单词总数（M）、不同单词的数量（N）以及最常见单词的重复次数（k(max)）的先验值决定。本文表明，这种最大熵预测也适用于用汉字书写的文本。特别指出的是，尽管用单词和汉字书写的同一中文文本具有截然不同的分布形状，但它们都能通过各自的三个先验特征值得到很好的预测。文中指出，这类似于将给定文本翻译成另一种语言时分布形状的变化。RGF预测的另一个结果是，取长文本的一部分会改变输入参数（M、N、k(max)），从而也会改变频率分布的形状。这一点在汉字书写的文本中得到了明确证实。由于RGF预测除了三个先验值（M、N、k(max)）之外没有特定于系统的信息，任何特定的语言特征都必须在与RGF预测和测量频率的系统偏差中寻找。文中识别出了一种这样的系统偏差，并通过统计信息理论论证和扩展的RGF模型，提出这种偏差是由汉字的多重含义引起的。这种影响在汉字上比在中文单词上更强。文中还讨论了齐普夫定律、文本的西蒙模型与当前结果之间的关系。