Silva Jorge M, Pinho Eduardo, Matos Sérgio, Pratas Diogo
Institute of Electronics and Informatics Engineering of Aveiro, University of Aveiro, 3810-193 Aveiro, Portugal.
Department of Electronics, Telecommunications and Informatics, University of Aveiro, 3810-193 Aveiro, Portugal.
Entropy (Basel). 2020 Jan 16;22(1):105. doi: 10.3390/e22010105.
Sources that generate symbolic sequences with algorithmic nature may differ in statistical complexity because they create structures that follow algorithmic schemes, rather than generating symbols from a probabilistic function assuming independence. In the case of Turing machines, this means that machines with the same algorithmic complexity can create tapes with different statistical complexity. In this paper, we use a compression-based approach to measure global and local statistical complexity of specific Turing machine tapes with the same number of states and alphabet. Both measures are estimated using the best-order Markov model. For the global measure, we use the Normalized Compression (NC), while, for the local measures, we define and use normal and dynamic complexity profiles to quantify and localize lower and higher regions of statistical complexity. We assessed the validity of our methodology on synthetic and real genomic data showing that it is tolerant to increasing rates of editions and block permutations. Regarding the analysis of the tapes, we localize patterns of higher statistical complexity in two regions, for a different number of machine states. We show that these patterns are generated by a decrease of the tape's amplitude, given the setting of small rule cycles. Additionally, we performed a comparison with a measure that uses both algorithmic and statistical approaches (BDM) for analysis of the tapes. Naturally, BDM is efficient given the algorithmic nature of the tapes. However, for a higher number of states, BDM is progressively approximated by our methodology. Finally, we provide a simple algorithm to increase the statistical complexity of a Turing machine tape while retaining the same algorithmic complexity. We supply a publicly available implementation of the algorithm in C++ language under the GPLv3 license. All results can be reproduced in full with scripts provided at the repository.
具有算法性质的符号序列生成源在统计复杂性上可能存在差异,因为它们创建的结构遵循算法方案,而不是从假设独立的概率函数生成符号。对于图灵机而言,这意味着具有相同算法复杂性的机器可以创建具有不同统计复杂性的磁带。在本文中,我们使用基于压缩的方法来测量具有相同状态数和字母表的特定图灵机磁带的全局和局部统计复杂性。这两种测量方法均使用最佳阶马尔可夫模型进行估计。对于全局测量,我们使用归一化压缩(NC),而对于局部测量,我们定义并使用正态和动态复杂性轮廓来量化和定位统计复杂性的较低和较高区域。我们在合成和真实基因组数据上评估了我们方法的有效性,结果表明它能容忍不断增加的编辑率和块置换率。关于磁带的分析,对于不同数量的机器状态,我们在两个区域定位了统计复杂性较高的模式。我们表明,在小规则周期设置的情况下,这些模式是由磁带幅度的减小产生的。此外,我们与一种使用算法和统计方法(BDM)对磁带进行分析的测量方法进行了比较。自然地,鉴于磁带的算法性质,BDM是有效的。然而,对于更多的状态,BDM逐渐被我们的方法所逼近。最后,我们提供了一种简单的算法来增加图灵机磁带的统计复杂性,同时保持相同的算法复杂性。我们在GPLv3许可下以C++语言提供了该算法的公开可用实现。所有结果都可以使用存储库中提供的脚本完全重现。