Ahmadi Lida, Ward Mark Daniel
Department of Mathematics, Purdue University, West Lafayette, IN 47907, USA.
Department of Statistics, Purdue University, West Lafayette, IN 47907, USA.
Entropy (Basel). 2020 Feb 12;22(2):207. doi: 10.3390/e22020207.
Patterns within strings enable us to extract vital information regarding a string's randomness. Understanding whether a string is random (Showing no to little repetition in patterns) or periodic (showing repetitions in patterns) are described by a value that is called the th Subword Complexity of the character string. By definition, the th Subword Complexity is the number of distinct substrings of length that appear in a given string. In this paper, we evaluate the expected value and the second factorial moment (followed by a corollary on the second moment) of the th Subword Complexity for the binary strings over memory-less sources. We first take a combinatorial approach to derive a probability generating function for the number of occurrences of patterns in strings of finite length. This enables us to have an exact expression for the two moments in terms of patterns' auto-correlation and correlation polynomials. We then investigate the asymptotic behavior for values of k = Θ ( log n ) . In the proof, we compare the distribution of the th Subword Complexity of binary strings to the distribution of distinct prefixes of independent strings stored in a trie. The methodology that we use involves complex analysis, analytical poissonization and depoissonization, the Mellin transform, and saddle point analysis.
字符串中的模式使我们能够提取有关字符串随机性的重要信息。通过一个称为字符串的第(k)子词复杂度的值来描述一个字符串是随机的(模式中几乎没有重复)还是周期性的(模式中有重复)。根据定义,第(k)子词复杂度是给定字符串中出现的长度为(k)的不同子串的数量。在本文中,我们评估了无记忆源上二进制字符串的第(k)子词复杂度的期望值和二阶阶乘矩(随后是关于二阶矩的一个推论)。我们首先采用组合方法来推导有限长度字符串中模式出现次数的概率生成函数。这使我们能够根据模式的自相关和相关多项式得到这两个矩的精确表达式。然后我们研究(k = Θ ( \log n ))时的渐近行为。在证明过程中,我们将二进制字符串的第(k)子词复杂度的分布与存储在字典树中的独立字符串的不同前缀的分布进行比较。我们使用的方法涉及复分析、解析泊松化和去泊松化、梅林变换以及鞍点分析。