Edgar Robert C
Nucleic Acids Res. 2004 Jan 16;32(1):380-5. doi: 10.1093/nar/gkh180. Print 2004.
Methods for discovery of local similarities and estimation of evolutionary distance by identifying k-mers (contiguous subsequences of length k) common to two sequences are described. Given unaligned sequences of length L, these methods have O(L) time complexity. The ability of compressed amino acid alphabets to extend these techniques to distantly related proteins was investigated. The performance of these algorithms was evaluated for different alphabets and choices of k using a test set of 1848 pairs of structurally alignable sequences selected from the FSSP database. Distance measures derived from k-mer counting were found to correlate well with percentage identity derived from sequence alignments. Compressed alphabets were seen to improve performance in local similarity discovery, but no evidence was found of improvements when applied to distance estimates. The performance of our local similarity discovery method was compared with the fast Fourier transform (FFT) used in MAFFT, which has O(L log L) time complexity. The method for achieving comparable coverage to FFT is revealed here, and is more than an order of magnitude faster. We suggest using k-mer distance for fast, approximate phylogenetic tree construction, and show that a speed improvement of more than three orders of magnitude can be achieved relative to standard distance methods, which require alignments.
描述了通过识别两个序列共有的k-mer(长度为k的连续子序列)来发现局部相似性和估计进化距离的方法。对于长度为L的未比对序列,这些方法具有O(L)的时间复杂度。研究了压缩氨基酸字母表将这些技术扩展到远缘相关蛋白质的能力。使用从FSSP数据库中选择的1848对结构可比对序列的测试集,针对不同的字母表和k的选择评估了这些算法的性能。发现从k-mer计数得出的距离度量与从序列比对得出的百分比同一性密切相关。压缩字母表在局部相似性发现中提高了性能,但在应用于距离估计时未发现性能提升的证据。将我们的局部相似性发现方法的性能与MAFFT中使用的快速傅里叶变换(FFT)进行了比较,后者具有O(L log L)的时间复杂度。这里揭示了实现与FFT相当覆盖范围的方法,并且速度快了一个多数量级。我们建议使用k-mer距离进行快速、近似的系统发育树构建,并表明相对于需要比对的标准距离方法,可以实现超过三个数量级的速度提升。