Mott R F, Kirkwood T B, Curnow R N
Laboratory of Mathematical Biology, National Institute for Medical Research, Mill Hill, London, U.K.
Bull Math Biol. 1990;52(6):773-84. doi: 10.1007/BF02460808.
An accurate approximation is derived to the distribution of the length of the longest matching word present between two random DNA sequences of finite length, using only elementary probability arguments. The distribution is shown to be consistent with previous asymptotic results for the mean and variance of longest common words. The application of the distribution to assessing the statistical significance of sequence similarities is considered. It is shown how the distribution can be modified to take account of non-independence of neighbouring bases in real sequences.
仅使用基本概率论证,就得出了有限长度的两个随机DNA序列之间最长匹配单词长度分布的精确近似值。结果表明,该分布与先前关于最长公共单词均值和方差的渐近结果一致。考虑了该分布在评估序列相似性统计显著性方面的应用。还展示了如何对该分布进行修正,以考虑实际序列中相邻碱基的非独立性。