Alvarez-Lacalle E, Dorow B, Eckmann J-P, Moses E
Department of Physics of Complex Systems and Albert Einstein Minerva Center for Theoretical Physics, The Weizmann Institute of Science, Rehovot 76100, Israel.
Proc Natl Acad Sci U S A. 2006 May 23;103(21):7956-61. doi: 10.1073/pnas.0510673103. Epub 2006 May 12.
Thoughts and ideas are multidimensional and often concurrent, yet they can be expressed surprisingly well sequentially by the translation into language. This reduction of dimensions occurs naturally but requires memory and necessitates the existence of correlations, e.g., in written text. However, correlations in word appearance decay quickly, while previous observations of long-range correlations using random walk approaches yield little insight on memory or on semantic context. Instead, we study combinations of words that a reader is exposed to within a "window of attention," spanning about 100 words. We define a vector space of such word combinations by looking at words that co-occur within the window of attention, and analyze its structure. Singular value decomposition of the co-occurrence matrix identifies a basis whose vectors correspond to specific topics, or "concepts" that are relevant to the text. As the reader follows a text, the "vector of attention" traces out a trajectory of directions in this "concept space." We find that memory of the direction is retained over long times, forming power-law correlations. The appearance of power laws hints at the existence of an underlying hierarchical network. Indeed, imposing a hierarchy similar to that defined by volumes, chapters, paragraphs, etc. succeeds in creating correlations in a surrogate random text that are identical to those of the original text. We conclude that hierarchical structures in text serve to create long-range correlations, and use the reader's memory in reenacting some of the multidimensionality of the thoughts being expressed.
思想和想法是多维度的且常常同时出现,但通过转化为语言,它们能以惊人的良好顺序被表达出来。这种维度的缩减自然发生,但需要记忆,并且需要相关性的存在,例如在书面文本中。然而,单词出现的相关性很快就会衰减,而先前使用随机游走方法对长程相关性的观察对记忆或语义语境几乎没有提供什么见解。相反,我们研究读者在一个约100个单词的“注意力窗口”内接触到的单词组合。我们通过查看在注意力窗口内共同出现的单词来定义这样的单词组合的向量空间,并分析其结构。共现矩阵的奇异值分解确定了一个基,其向量对应于与文本相关的特定主题或“概念”。当读者阅读文本时,“注意力向量”在这个“概念空间”中描绘出一条方向轨迹。我们发现方向的记忆能长时间保留,形成幂律相关性。幂律的出现暗示了一个潜在的层次网络的存在。确实,强加一个类似于由卷、章、段等定义的层次结构,成功地在一个替代随机文本中创建了与原始文本相同的相关性。我们得出结论,文本中的层次结构有助于创建长程相关性,并利用读者的记忆来重现所表达思想的一些多维度性。