一种用于解决k错配平均公共子串问题的可证明高效算法。

A Provably Efficient Algorithm for the k-Mismatch Average Common Substring Problem.

作者信息

Thankachan Sharma V, Apostolico Alberto, Aluru Srinivas

机构信息

College of Computing, Georgia Institute of Technology , Atlanta, Georgia .

出版信息

J Comput Biol. 2016 Jun;23(6):472-82. doi: 10.1089/cmb.2015.0235. Epub 2016 Apr 8.

DOI:10.1089/cmb.2015.0235

PMID:27058840

Abstract

Alignment-free sequence comparison methods are attracting persistent interest, driven by data-intensive applications in genome-wide molecular taxonomy and phylogenetic reconstruction. Among all the methods based on substring composition, the average common substring (ACS) measure admits a straightforward linear time sequence comparison algorithm, while yielding impressive results in multiple applications. An important direction of this research is to extend the approach to permit a bounded edit/hamming distance between substrings, so as to reflect more accurately the evolutionary process. To date, however, algorithms designed to incorporate k ≥ 1 mismatches have O(n(2)) worst-case time complexity, where n is the total length of the input sequences. On the other hand, accounting for mismatches has shown to lead to much improved classification, while heuristics can improve practical performance. In this article, we close the gap by presenting the first provably efficient algorithm for the k-mismatch average common string (ACSk) problem that takes O(n) space and O(n log(k) n) time in the worst case for any constant k. Our method extends the generalized suffix tree model to incorporate a carefully selected bounded set of perturbed suffixes, and can be applied to other complex approximate sequence matching problems.

摘要

无比对序列比较方法一直备受关注，这是由全基因组分子分类学和系统发育重建中数据密集型应用所推动的。在所有基于子串组成的方法中，平均公共子串（ACS）度量允许一种直接的线性时间序列比较算法，同时在多个应用中产生令人印象深刻的结果。该研究的一个重要方向是扩展该方法，以允许子串之间存在有界编辑/汉明距离，从而更准确地反映进化过程。然而，迄今为止，设计用于纳入k≥1个错配的算法具有O(n(2))的最坏情况时间复杂度，其中n是输入序列的总长度。另一方面，考虑错配已被证明能显著改善分类，而启发式方法可以提高实际性能。在本文中，我们通过提出第一个针对k错配平均公共字符串（ACSk）问题的可证明高效算法来缩小差距，该算法在最坏情况下对于任何常数k都占用O(n)空间且时间复杂度为O(n log(k) n)。我们的方法扩展了广义后缀树模型，纳入了精心选择的有界扰动后缀集，并且可以应用于其他复杂的近似序列匹配问题。

相似文献

A Provably Efficient Algorithm for the k-Mismatch Average Common Substring Problem.

J Comput Biol. 2016 Jun;23(6):472-82. doi: 10.1089/cmb.2015.0235. Epub 2016 Apr 8.

ALFRED: A Practical Method for Alignment-Free Distance Computation.

J Comput Biol. 2016 Jun;23(6):452-60. doi: 10.1089/cmb.2015.0217. Epub 2016 May 3.

An alignment-free heuristic for fast sequence comparisons with applications to phylogeny reconstruction.

BMC Bioinformatics. 2020 Nov 18;21(Suppl 6):404. doi: 10.1186/s12859-020-03738-5.

A greedy alignment-free distance estimator for phylogenetic inference.

BMC Bioinformatics. 2017 Jun 7;18(Suppl 8):238. doi: 10.1186/s12859-017-1658-0.

Kmacs: the k-mismatch average common substring approach to alignment-free sequence comparison.

Bioinformatics. 2014 Jul 15;30(14):2000-8. doi: 10.1093/bioinformatics/btu331. Epub 2014 May 13.

An Ultra-Fast and Parallelizable Algorithm for Finding k-Mismatch Shortest Unique Substrings.

IEEE/ACM Trans Comput Biol Bioinform. 2021 Jan-Feb;18(1):138-148. doi: 10.1109/TCBB.2020.2968531. Epub 2021 Feb 3.

An algorithm for approximate tandem repeats.

J Comput Biol. 2001;8(1):1-18. doi: 10.1089/106652701300099038.

Error Tree: A Tree Structure for Hamming and Edit Distances and Wildcards Matching.

J Comput Biol. 2015 Dec;22(12):1118-28. doi: 10.1089/cmb.2015.0132. Epub 2015 Sep 24.

libFLASM: a software library for fixed-length approximate string matching.

BMC Bioinformatics. 2016 Nov 10;17(1):454. doi: 10.1186/s12859-016-1320-2.

An efficient algorithm for statistical multiple alignment on arbitrary phylogenetic trees.

J Comput Biol. 2003;10(6):869-89. doi: 10.1089/106652703322756122.

引用本文的文献

'Multi-SpaM': a maximum-likelihood approach to phylogeny reconstruction using multiple spaced-word matches and quartet trees.

NAR Genom Bioinform. 2019 Oct 30;2(1):lqz013. doi: 10.1093/nargab/lqz013. eCollection 2020 Mar.

Phylogeny reconstruction based on the length distribution of -mismatch common substrings.

Algorithms Mol Biol. 2017 Dec 11;12:27. doi: 10.1186/s13015-017-0118-8. eCollection 2017.

A greedy alignment-free distance estimator for phylogenetic inference.

BMC Bioinformatics. 2017 Jun 7;18(Suppl 8):238. doi: 10.1186/s12859-017-1658-0.

rasbhari: Optimizing Spaced Seeds for Database Searching, Read Mapping and Alignment-Free Sequence Comparison.

PLoS Comput Biol. 2016 Oct 19;12(10):e1005107. doi: 10.1371/journal.pcbi.1005107. eCollection 2016 Oct.

Phylogeny Reconstruction with Alignment-Free Method That Corrects for Horizontal Gene Transfer.

PLoS Comput Biol. 2016 Jun 23;12(6):e1004985. doi: 10.1371/journal.pcbi.1004985. eCollection 2016 Jun.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

一种用于解决k错配平均公共子串问题的可证明高效算法。

A Provably Efficient Algorithm for the k-Mismatch Average Common Substring Problem.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献