Thankachan Sharma V, Apostolico Alberto, Aluru Srinivas
College of Computing, Georgia Institute of Technology , Atlanta, Georgia .
J Comput Biol. 2016 Jun;23(6):472-82. doi: 10.1089/cmb.2015.0235. Epub 2016 Apr 8.
Alignment-free sequence comparison methods are attracting persistent interest, driven by data-intensive applications in genome-wide molecular taxonomy and phylogenetic reconstruction. Among all the methods based on substring composition, the average common substring (ACS) measure admits a straightforward linear time sequence comparison algorithm, while yielding impressive results in multiple applications. An important direction of this research is to extend the approach to permit a bounded edit/hamming distance between substrings, so as to reflect more accurately the evolutionary process. To date, however, algorithms designed to incorporate k ≥ 1 mismatches have O(n(2)) worst-case time complexity, where n is the total length of the input sequences. On the other hand, accounting for mismatches has shown to lead to much improved classification, while heuristics can improve practical performance. In this article, we close the gap by presenting the first provably efficient algorithm for the k-mismatch average common string (ACSk) problem that takes O(n) space and O(n log(k) n) time in the worst case for any constant k. Our method extends the generalized suffix tree model to incorporate a carefully selected bounded set of perturbed suffixes, and can be applied to other complex approximate sequence matching problems.
无比对序列比较方法一直备受关注,这是由全基因组分子分类学和系统发育重建中数据密集型应用所推动的。在所有基于子串组成的方法中,平均公共子串(ACS)度量允许一种直接的线性时间序列比较算法,同时在多个应用中产生令人印象深刻的结果。该研究的一个重要方向是扩展该方法,以允许子串之间存在有界编辑/汉明距离,从而更准确地反映进化过程。然而,迄今为止,设计用于纳入k≥1个错配的算法具有O(n(2))的最坏情况时间复杂度,其中n是输入序列的总长度。另一方面,考虑错配已被证明能显著改善分类,而启发式方法可以提高实际性能。在本文中,我们通过提出第一个针对k错配平均公共字符串(ACSk)问题的可证明高效算法来缩小差距,该算法在最坏情况下对于任何常数k都占用O(n)空间且时间复杂度为O(n log(k) n)。我们的方法扩展了广义后缀树模型,纳入了精心选择的有界扰动后缀集,并且可以应用于其他复杂的近似序列匹配问题。