Institute of Physics of São Carlos, University of São Paulo, São Carlos, São Paulo, Brazil.
PLoS One. 2013 Jul 2;8(7):e67310. doi: 10.1371/journal.pone.0067310. Print 2013.
While the use of statistical physics methods to analyze large corpora has been useful to unveil many patterns in texts, no comprehensive investigation has been performed on the interdependence between syntactic and semantic factors. In this study we propose a framework for determining whether a text (e.g., written in an unknown alphabet) is compatible with a natural language and to which language it could belong. The approach is based on three types of statistical measurements, i.e. obtained from first-order statistics of word properties in a text, from the topology of complex networks representing texts, and from intermittency concepts where text is treated as a time series. Comparative experiments were performed with the New Testament in 15 different languages and with distinct books in English and Portuguese in order to quantify the dependency of the different measurements on the language and on the story being told in the book. The metrics found to be informative in distinguishing real texts from their shuffled versions include assortativity, degree and selectivity of words. As an illustration, we analyze an undeciphered medieval manuscript known as the Voynich Manuscript. We show that it is mostly compatible with natural languages and incompatible with random texts. We also obtain candidates for keywords of the Voynich Manuscript which could be helpful in the effort of deciphering it. Because we were able to identify statistical measurements that are more dependent on the syntax than on the semantics, the framework may also serve for text analysis in language-dependent applications.
虽然使用统计物理方法来分析大型语料库已经有助于揭示文本中的许多模式,但对于句法和语义因素之间的相互依存关系还没有进行全面的研究。在本研究中,我们提出了一种确定文本(例如,用未知字母书写)是否与自然语言兼容以及它可能属于哪种语言的框架。该方法基于三种类型的统计测量,即从文本中单词属性的一阶统计量、表示文本的复杂网络的拓扑结构以及将文本视为时间序列的间歇概念中获得。在 15 种不同语言的新约圣经以及英语和葡萄牙语的不同书籍中进行了比较实验,以量化不同测量值对语言和书籍中所讲述的故事的依赖性。在区分真实文本与其随机版本时发现的有信息量的指标包括配价、单词的度数和选择性。作为说明,我们分析了一个称为伏尼契手稿的未破译的中世纪手稿。我们表明,它与自然语言大多兼容,与随机文本不兼容。我们还获得了伏尼契手稿关键词的候选者,这可能有助于破译它。由于我们能够识别出与语法比语义更相关的统计测量,因此该框架也可以用于语言相关应用中的文本分析。