Sivan Doron, Tsodyks Misha
Department of Brain Sciences, Weizmann Institute of Science, Rehovot 76100, Israel.
School of Natural Sciences, Institute for Advanced Study, Princeton, NJ 08540.
Proc Natl Acad Sci U S A. 2025 Jun 24;122(25):e2502353122. doi: 10.1073/pnas.2502353122. Epub 2025 Jun 18.
In Shannon's seminal paper, the entropy of printed English, treated as a stationary stochastic process, was estimated to be roughly 1 bit per character. However, considered as a means of communication, language differs considerably from its printed form: i) the units of information are not characters or even words but clauses, i.e., shortest meaningful parts of speech; and ii) what is transmitted is principally the meaning of what is being said or written, while the precise phrasing that was used to communicate the meaning is typically ignored. In this study, we show that one can leverage recently developed large language models to quantify information communicated in meaningful narratives in terms of bits of meaning per clause.
在香农的开创性论文中,将印刷英语视为平稳随机过程时,估计其熵约为每个字符1比特。然而,作为一种交流方式,语言与其印刷形式有很大不同:i)信息单位不是字符甚至单词,而是从句,即最短的有意义的词性部分;ii)所传递的主要是所说或所写内容的含义,而用于传达该含义的精确措辞通常被忽略。在本研究中,我们表明可以利用最近开发的大语言模型,根据每个从句的意义比特数来量化有意义叙述中传达的信息。