Forêt Sylvain, Kantorovitz Miriam R, Burden Conrad J
Mathematical Sciences Institute, Australian National University, Canberra, ACT 0200, Australia.
BMC Bioinformatics. 2006 Dec 18;7 Suppl 5(Suppl 5):S21. doi: 10.1186/1471-2105-7-S5-S21.
The number of k-words shared between two sequences is a simple and efficient alignment-free sequence comparison method. This statistic, D2, has been used for the clustering of EST sequences. Sequence comparison based on D2 is extremely fast, its runtime is proportional to the size of the sequences under scrutiny, whereas alignment-based comparisons have a worst-case run time proportional to the square of the size. Recent studies have tackled the rigorous study of the statistical distribution of D2, and asymptotic regimes have been derived. The distribution of approximate k-word matches has also been studied.
We have computed the D2 optimal word size for various sequence lengths, and for both perfect and approximate word matches. Kolmogorov-Smirnov tests show D2 to have a compound Poisson distribution at the optimal word size for small sequence lengths (below 400 letters) and a normal distribution at the optimal word size for large sequence lengths (above 1600 letters). We find that the D2 statistic outperforms BLAST in the comparison of artificially evolved sequences, and performs similarly to other methods based on exact word matches. These results obtained with randomly generated sequences are also valid for sequences derived from human genomic DNA.
We have characterized the distribution of the D2 statistic at optimal word sizes. We find that the best trade-off between computational efficiency and accuracy is obtained with exact word matches. Given that our numerical tests have not included sequence shuffling, transposition or splicing, the improvements over existing methods reported here underestimate that expected in real sequences. Because of the linear run time and of the known normal asymptotic behavior, D2-based methods are most appropriate for large genomic sequences.
两个序列之间共享的k字数量是一种简单有效的无比对序列比较方法。这个统计量D2已被用于EST序列的聚类。基于D2的序列比较速度极快,其运行时间与所研究序列的大小成正比,而基于比对的比较在最坏情况下的运行时间与序列大小的平方成正比。最近的研究对D2的统计分布进行了严格研究,并得出了渐近情况。近似k字匹配的分布也已得到研究。
我们计算了不同序列长度下D2的最优字长,包括完全字匹配和近似字匹配的情况。柯尔莫哥洛夫-斯米尔诺夫检验表明,对于短序列长度(低于400个字母),D2在最优字长下具有复合泊松分布;对于长序列长度(高于1600个字母),D2在最优字长下具有正态分布。我们发现,在人工进化序列的比较中,D2统计量优于BLAST,并且与基于精确字匹配的其他方法表现相似。这些通过随机生成序列获得的结果对于源自人类基因组DNA的序列同样有效。
我们已确定了最优字长下D2统计量的分布特征。我们发现,通过精确字匹配可以在计算效率和准确性之间实现最佳权衡。鉴于我们的数值测试未包括序列重排、转座或剪接,此处报告的相对于现有方法的改进低估了实际序列中的预期改进。由于运行时间呈线性且具有已知的正态渐近行为,基于D2的方法最适合用于大型基因组序列。