Dębowski Łukasz
Institute of Computer Science, Polish Academy of Sciences, ul. Jana Kazimierza 5, 01-248 Warszawa, Poland.
Entropy (Basel). 2021 Sep 1;23(9):1148. doi: 10.3390/e23091148.
We present a hypothetical argument against finite-state processes in statistical language modeling that is based on semantics rather than syntax. In this theoretical model, we suppose that the semantic properties of texts in a natural language could be approximately captured by a recently introduced concept of a perigraphic process. Perigraphic processes are a class of stochastic processes that satisfy a Zipf-law accumulation of a subset of factual knowledge, which is time-independent, compressed, and effectively inferrable from the process. We show that the classes of finite-state processes and of perigraphic processes are disjoint, and we present a new simple example of perigraphic processes over a finite alphabet called Oracle processes. The disjointness result makes use of the Hilberg condition, i.e., the almost sure power-law growth of algorithmic mutual information. Using a strongly consistent estimator of the number of hidden states, we show that finite-state processes do not satisfy the Hilberg condition whereas Oracle processes satisfy the Hilberg condition via the data-processing inequality. We discuss the relevance of these mathematical results for theoretical and computational linguistics.
我们提出了一个基于语义而非句法的针对统计语言建模中有限状态过程的假设性论证。在这个理论模型中,我们假设自然语言文本的语义属性可以通过最近引入的周边图过程的概念近似地捕捉到。周边图过程是一类随机过程,它满足关于一部分事实性知识的齐普夫定律累积,这部分知识与时间无关、经过压缩且可从该过程有效推断出来。我们表明有限状态过程类和周边图过程类是不相交的,并且我们给出了一个在有限字母表上的周边图过程的新的简单示例,称为预言机过程。不相交性结果利用了希尔伯格条件,即算法互信息几乎必然的幂律增长。通过使用隐藏状态数量的强一致估计量,我们表明有限状态过程不满足希尔伯格条件,而预言机过程通过数据处理不等式满足希尔伯格条件。我们讨论了这些数学结果对于理论语言学和计算语言学的相关性。