Reinert Gesine, Chew David, Sun Fengzhu, Waterman Michael S
Department of Statistics, University of Oxford, Oxford OX1 3TG, UK.
J Comput Biol. 2009 Dec;16(12):1615-34. doi: 10.1089/cmb.2009.0198.
Large-scale comparison of the similarities between two biological sequences is a major issue in computational biology; a fast method, the D(2) statistic, relies on the comparison of the k-tuple content for both sequences. Although it has been known for some years that the D(2) statistic is not suitable for this task, as it tends to be dominated by single-sequence noise, to date no suitable adjustments have been proposed. In this article, we suggest two new variants of the D(2) word count statistic, which we call D(2)(S) and D(2)(). For D(2)(S), which is a self-standardized statistic, we show that the statistic is asymptotically normally distributed, when sequence lengths tend to infinity, and not dominated by the noise in the individual sequences. The second statistic, D(2)(), outperforms D(2)(S) in terms of power for detecting the relatedness between the two sequences in our examples; but although it is straightforward to simulate from the asymptotic distribution of D(2)(*), we cannot provide a closed form for power calculations.
大规模比较两个生物序列之间的相似性是计算生物学中的一个主要问题;一种快速方法,即D(2)统计量,依赖于对两个序列的k元组含量进行比较。尽管多年来人们已经知道D(2)统计量不适合这项任务,因为它往往受单序列噪声的主导,但迄今为止尚未提出合适的调整方法。在本文中,我们提出了D(2)词计数统计量的两个新变体,我们称之为D(2)(S)和D(2)()。对于作为自标准化统计量的D(2)(S),我们表明当序列长度趋于无穷大时,该统计量渐近正态分布,且不受单个序列中噪声的主导。第二个统计量D(2)()在我们的示例中,在检测两个序列之间相关性的功效方面优于D(2)(S);但是尽管从D(2)(*)的渐近分布进行模拟很简单,但我们无法提供用于功效计算的封闭形式。