Schbath S
Institut National de la Recherche Agronomique, Unité de Biométrie, Jouy-en-Josas, France.
J Comput Biol. 2000 Feb-Apr;7(1-2):193-201. doi: 10.1089/10665270050081469.
In this paper, we give an overview about the different results existing on the statistical distribution of word counts in a Markovian sequence of letters. Results concerning the number of overlapping occurrences, the number of renewals and the number of clumps will be presented. Counts of single words and also multiple words are considered. Most of the results are approximations as the length of the sequence tends to infinity. We will see that Gaussian approximations switch to (compound) Poisson approximations for rare words. Modeling DNA sequences or proteins by stationary Markov chains, these results can be used to study the statistical frequency of motifs in a given sequence.
在本文中,我们概述了关于马尔可夫字母序列中单词计数统计分布的不同现有结果。将给出关于重叠出现次数、更新次数和团块数量的结果。同时考虑单个单词和多个单词的计数。大多数结果是序列长度趋于无穷时的近似值。我们将看到,对于罕见单词,高斯近似会转变为(复合)泊松近似。通过平稳马尔可夫链对DNA序列或蛋白质进行建模,这些结果可用于研究给定序列中基序的统计频率。