Dębowski Łukasz
Institute of Computer Science, Polish Academy of Sciences, ul. Jana Kazimierza 5, 01-248 Warszawa, Poland.
Entropy (Basel). 2018 Jan 26;20(2):85. doi: 10.3390/e20020085.
As we discuss, a stationary stochastic process is nonergodic when a random persistent topic can be detected in the infinite random text sampled from the process, whereas we call the process strongly nonergodic when an infinite sequence of independent random bits, called probabilistic facts, is needed to describe this topic completely. Replacing probabilistic facts with an algorithmically random sequence of bits, called algorithmic facts, we adapt this property back to ergodic processes. Subsequently, we call a process perigraphic if the number of algorithmic facts which can be inferred from a finite text sampled from the process grows like a power of the text length. We present a simple example of such a process. Moreover, we demonstrate an assertion which we call the theorem about facts and words. This proposition states that the number of probabilistic or algorithmic facts which can be inferred from a text drawn from a process must be roughly smaller than the number of distinct word-like strings detected in this text by means of the Prediction by Partial Matching (PPM) compression algorithm. We also observe that the number of the word-like strings for a sample of plays by Shakespeare follows an empirical stepwise power law, in a stark contrast to Markov processes. Hence, we suppose that natural language considered as a process is not only non-Markov but also perigraphic.
如我们所讨论的,当从一个平稳随机过程中抽取的无限随机文本中能够检测到一个随机持久主题时,该平稳随机过程是非遍历的;而当需要一个由独立随机比特组成的无限序列(称为概率事实)来完整描述这个主题时,我们称该过程为强非遍历的。用一个由算法生成的随机比特序列(称为算法事实)来取代概率事实,我们将这个特性应用回遍历过程。随后,如果从该过程中抽取的有限文本中能够推断出的算法事实的数量随文本长度的幂次增长,我们就称这个过程为周边图式的。我们给出了这样一个过程的简单示例。此外,我们证明了一个我们称为关于事实和单词的定理的断言。这个命题表明,从一个过程中抽取的文本中能够推断出的概率或算法事实的数量,必须大致小于通过部分匹配预测(PPM)压缩算法在该文本中检测到的不同单词状字符串的数量。我们还观察到,莎士比亚戏剧样本中的单词状字符串数量遵循经验性的阶梯幂律,这与马尔可夫过程形成鲜明对比。因此,我们推测,作为一个过程来考虑的自然语言不仅是非马尔可夫的,而且是周边图式的。