Wan Lin, Reinert Gesine, Sun Fengzhu, Waterman Michael S
Molecular and Computational Biology, University of Southern California , Los Angeles, California 90089-2910, USA.
J Comput Biol. 2010 Nov;17(11):1467-90. doi: 10.1089/cmb.2010.0056. Epub 2010 Oct 25.
Rapid methods for alignment-free sequence comparison make large-scale comparisons between sequences increasingly feasible. Here we study the power of the statistic D2, which counts the number of matching k-tuples between two sequences, as well as D2*, which uses centralized counts, and D2S, which is a self-standardized version, both from a theoretical viewpoint and numerically, providing an easy to use program. The power is assessed under two alternative hidden Markov models; the first one assumes that the two sequences share a common motif, whereas the second model is a pattern transfer model; the null model is that the two sequences are composed of independent and identically distributed letters and they are independent. Under the first alternative model, the means of the tuple counts in the individual sequences change, whereas under the second alternative model, the marginal means are the same as under the null model. Using the limit distributions of the count statistics under the null and the alternative models, we find that generally, asymptotically D2S has the largest power, followed by D2*, whereas the power of D2 can even be zero in some cases. In contrast, even for sequences of length 140,000 bp, in simulations D2* generally has the largest power. Under the first alternative model of a shared motif, the power of D2approaches 100% when sufficiently many motifs are shared, and we recommend the use of D2 for such practical applications. Under the second alternative model of pattern transfer,the power for all three count statistics does not increase with sequence length when the sequence is sufficiently long, and hence none of the three statistics under consideration canbe recommended in such a situation. We illustrate the approach on 323 transcription factor binding motifs with length at most 10 from JASPAR CORE (October 12, 2009 version),verifying that D2* is generally more powerful than D2. The program to calculate the power of D2, D2* and D2S can be downloaded from http://meta.cmb.usc.edu/d2. Supplementary Material is available at www.liebertonline.com/cmb.
用于无比对序列比较的快速方法使得序列之间的大规模比较变得越来越可行。在此,我们从理论和数值两个角度研究统计量D2(计算两个序列之间匹配k元组的数量)、使用中心化计数的D2以及自标准化版本D2S的效能,并提供一个易于使用的程序。在两种替代隐马尔可夫模型下评估效能;第一种模型假设两个序列共享一个共同基序,而第二种模型是模式转移模型;零模型是两个序列由独立同分布的字母组成且它们相互独立。在第一种替代模型下,各个序列中k元组计数的均值会发生变化,而在第二种替代模型下,边际均值与零模型下相同。利用零模型和替代模型下计数统计量的极限分布,我们发现一般来说,渐近地D2S效能最大,其次是D2,而在某些情况下D2的效能甚至可能为零。相比之下,即使对于长度为140,000 bp的序列,在模拟中D2通常效能最大。在共享基序的第一种替代模型下,当共享足够多的基序时,D2的效能接近100%,我们建议在这种实际应用中使用D2*。在模式转移的第二种替代模型下,当序列足够长时,所有三个计数统计量的效能不会随序列长度增加,因此在这种情况下不推荐使用所考虑的这三个统计量中的任何一个。我们用来自JASPAR CORE(2009年10月12日版本)的最多10个长度的323个转录因子结合基序来说明该方法,验证了D2通常比D2更有效。计算D2、D2和D2S效能的程序可从http://meta.cmb.usc.edu/d2下载。补充材料可在www.liebertonline.com/cmb获取。