量化词的长程顺序中的信息：语义结构与普遍语言限制

Quantifying the information in the long-range order of words: semantic structures and universal linguistic constraints.

作者信息

Montemurro Marcelo A

机构信息

The University of Manchester, Faculty of Life Sciences, Manchester, United Kingdom.

出版信息

Cortex. 2014 Jun;55:5-16. doi: 10.1016/j.cortex.2013.08.008. Epub 2013 Aug 29.

DOI:10.1016/j.cortex.2013.08.008

PMID:24074456

Abstract

We review some recent progress on the characterisation of long-range patterns of word use in language using methods from information theory. In particular, two levels of structure in language are considered. The first level corresponds to the patterns of words usage over different contextual domains. A direct application of information theory to quantify the specificity of words across different sections of a linguistic sequence leads to a measure of semantic information. Moreover, a natural scale emerges that characterises the typical size of semantic structures. Since the information measure is made up of additive contributions from individual words, it is possible to rank the words according to their overall weight in the total information. This allows the extraction of keywords most relevant to the semantic content of the sequence without any prior knowledge of the language. The second level considered is the complex structure of correlations among words in linguistic sequences. The degree of order in language can be quantified by means of the entropy. Reliable estimates of the entropy were obtained from corpora of texts from several linguistic families by means of lossless compression algorithms. The value of the entropy fluctuates across different languages since it depends on linguistic organisation at various levels. However, when a measure of relative entropy that specifically quantifies the degree of word ordering in language is estimated, it presents an almost constant value over all the linguistic families studied. This suggests that the entropy of word ordering is a novel quantitative linguistic universal.

摘要

我们回顾了利用信息论方法在语言中单词使用的长程模式表征方面的一些最新进展。具体而言，我们考虑了语言的两个结构层次。第一个层次对应于不同语境域中单词的使用模式。将信息论直接应用于量化语言序列不同部分中单词的特异性，可得出一种语义信息度量。此外，还出现了一个表征语义结构典型大小的自然尺度。由于信息度量由各个单词的累加贡献组成，因此可以根据单词在总信息中的总体权重对其进行排序。这使得在无需任何语言先验知识的情况下，能够提取与序列语义内容最相关的关键词。所考虑的第二个层次是语言序列中单词之间的复杂关联结构。语言中的有序程度可以通过熵来量化。通过无损压缩算法，从几个语系的文本语料库中获得了可靠的熵估计值。熵的值在不同语言中会有所波动，因为它取决于各个层次的语言组织。然而，当估计一种专门量化语言中单词排序程度的相对熵度量时，在所有研究的语系中它呈现出几乎恒定的值。这表明单词排序的熵是一种新的定量语言共性。

相似文献

Quantifying the information in the long-range order of words: semantic structures and universal linguistic constraints.

Cortex. 2014 Jun;55:5-16. doi: 10.1016/j.cortex.2013.08.008. Epub 2013 Aug 29.

Universal entropy of word ordering across linguistic families.

PLoS One. 2011;6(5):e19875. doi: 10.1371/journal.pone.0019875. Epub 2011 May 13.

Entropy, semantic relatedness and proximity.

Behav Res Methods. 2011 Sep;43(3):746-60. doi: 10.3758/s13428-011-0087-7.

A Mathematical Model for Universal Semantics.

IEEE Trans Pattern Anal Mach Intell. 2022 Mar;44(3):1124-1132. doi: 10.1109/TPAMI.2020.3022533. Epub 2022 Feb 3.

Quantifying Semantic Linguistic Maturity in Children.

J Psycholinguist Res. 2016 Oct;45(5):1183-99. doi: 10.1007/s10936-015-9398-7.

The interaction of domain knowledge and linguistic structure in natural language processing: interpreting hypernymic propositions in biomedical text.

J Biomed Inform. 2003 Dec;36(6):462-77. doi: 10.1016/j.jbi.2003.11.003.

Semantic diversity: a measure of semantic ambiguity based on variability in the contextual usage of words.

Behav Res Methods. 2013 Sep;45(3):718-30. doi: 10.3758/s13428-012-0278-x.

On the universal structure of human lexical semantics.

Proc Natl Acad Sci U S A. 2016 Feb 16;113(7):1766-71. doi: 10.1073/pnas.1520752113. Epub 2016 Feb 1.

The role of corpus size and syntax in deriving lexico-semantic representations for a wide range of concepts.

Q J Exp Psychol (Hove). 2015;68(8):1643-64. doi: 10.1080/17470218.2014.994098. Epub 2015 Feb 26.

Understanding the spatial dimension of natural language by measuring the spatial semantic similarity of words through a scalable geospatial context window.

PLoS One. 2020 Jul 23;15(7):e0236347. doi: 10.1371/journal.pone.0236347. eCollection 2020.

引用本文的文献

Evaluation of Error Production in Animal Fluency and Its Relationship to Frontal Tracts in Normal Aging and Mild Alzheimer's Disease: A Combined LDA and Time-Course Analysis Investigation.

Front Aging Neurosci. 2022 Jan 12;13:710938. doi: 10.3389/fnagi.2021.710938. eCollection 2021.

Using information-theoretic measures to characterize the structure of the writing system: the case of orthographic-phonological regularities in English.

Behav Res Methods. 2020 Jun;52(3):1292-1312. doi: 10.3758/s13428-019-01317-y.

Do neural nets learn statistical laws behind natural language?

PLoS One. 2017 Dec 29;12(12):e0189326. doi: 10.1371/journal.pone.0189326. eCollection 2017.

Long-Range Memory in Literary Texts: On the Universal Clustering of the Rare Words.

PLoS One. 2016 Nov 28;11(11):e0164658. doi: 10.1371/journal.pone.0164658. eCollection 2016.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

量化词的长程顺序中的信息：语义结构与普遍语言限制

Quantifying the information in the long-range order of words: semantic structures and universal linguistic constraints.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献