Baeza-Blancas Edgar, Obregón-Quintana Bibiana, Hernández-Gómez Candelario, Gómez-Meléndez Domingo, Aguilar-Velázquez Daniel, Liebovitch Larry S, Guzmán-Vargas Lev
Departamento de Física, Escuela Superior de Física y Matemáticas, Ciudad de México 07738, Mexico.
Unidad Profesional Interdisciplinaria en Ingeniería y Tecnologías Avanzadas, Instituto Politécnico Nacional, Ciudad de México 07340, Mexico.
Entropy (Basel). 2019 May 23;21(5):517. doi: 10.3390/e21050517.
We present a study of natural language using the recurrence network method. In our approach, the repetition of patterns of characters is evaluated without considering the word structure in written texts from different natural languages. Our dataset comprises 85 ebookseBooks written in 17 different European languages. The similarity between patterns of length is determined by the Hamming distance and a value is considered to define a matching between two patterns, i.e., a repetition is defined if the Hamming distance is equal or less than the given threshold value . In this way, we calculate the adjacency matrix, where a connection between two nodes exists when a matching occurs. Next, the recurrence network is constructed for the texts and some representative network metrics are calculated. Our results show that average values of network density, clustering, and assortativity are larger than their corresponding shuffled versions, while for metrics like such as closeness, both original and random sequences exhibit similar values. Moreover, our calculations show similar average values for density among languages which that belong to the same linguistic family. In addition, the application of a linear discriminant analysis leads to well-separated clusters of family languages based on based on the network-density properties. Finally, we discuss our results in the context of the general characteristics of written texts.
我们使用递归网络方法对自然语言进行了一项研究。在我们的方法中,评估字符模式的重复情况时不考虑来自不同自然语言的书面文本中的单词结构。我们的数据集包含用17种不同欧洲语言编写的85本电子书。长度为 的模式之间的相似度由汉明距离确定,并且认为值 定义了两个模式之间的匹配,即如果汉明距离等于或小于给定阈值 ,则定义为重复。通过这种方式,我们计算邻接矩阵,当出现匹配时,两个节点之间存在连接。接下来,为文本构建递归网络并计算一些代表性的网络指标。我们的结果表明,网络密度、聚类和 assortativity 的平均值大于其相应的随机打乱版本,而对于诸如接近度等指标,原始序列和随机序列表现出相似的值。此外,我们的计算表明,属于同一语系的语言之间的密度平均值相似。此外,线性判别分析的应用基于网络密度属性导致语系语言的聚类分得很开。最后,我们在书面文本的一般特征背景下讨论我们的结果。