一种基于比对的启发式算法，用于快速的序列比对，可应用于系统发育重建。

An alignment-free heuristic for fast sequence comparisons with applications to phylogeny reconstruction.

机构信息

Institute for Data Engineering and Science, Georiga Institute of Technology, 756 W Peachtree Street NW, Atlanta, USA.

Department of Computer Science, University of Central Florida, 4000 Central Florida Blvd, Orlando, USA.

出版信息

BMC Bioinformatics. 2020 Nov 18;21(Suppl 6):404. doi: 10.1186/s12859-020-03738-5.

DOI:10.1186/s12859-020-03738-5

PMID:33203364

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7672814/

Abstract

BACKGROUND

Alignment-free methods for sequence comparisons have become popular in many bioinformatics applications, specifically in the estimation of sequence similarity measures to construct phylogenetic trees. Recently, the average common substring measure, ACS, and its k-mismatch counterpart, ACS, have been shown to produce results as effective as multiple-sequence alignment based methods for reconstruction of phylogeny trees. Since computing ACS takes O(n logkn) time and hence impractical for large datasets, multiple heuristics that can approximate ACS have been introduced.

RESULTS

In this paper, we present a novel linear-time heuristic to approximate ACS, which is faster than computing the exact ACS while being closer to the exact ACS values compared to previously published linear-time greedy heuristics. Using four real datasets, containing both DNA and protein sequences, we evaluate our algorithm in terms of accuracy, runtime and demonstrate its applicability for phylogeny reconstruction. Our algorithm provides better accuracy than previously published heuristic methods, while being comparable in its applications to phylogeny reconstruction.

CONCLUSIONS

Our method produces a better approximation for ACS and is applicable for the alignment-free comparison of biological sequences at highly competitive speed. The algorithm is implemented in Rust programming language and the source code is available at https://github.com/srirampc/adyar-rs .

摘要

背景

无比对方法在许多生物信息学应用中变得流行，特别是在估计序列相似性度量以构建系统发育树方面。最近，平均公共子串度量 ACS 及其 k 错配对应物 ACS 已被证明在重建系统发育树方面与基于多重序列比对的方法一样有效。由于计算 ACS 需要 O(n logkn) 的时间，因此对于大型数据集来说不切实际，因此已经引入了多种可以近似 ACS 的启发式方法。

结果

在本文中，我们提出了一种新的线性时间启发式方法来近似 ACS，它比计算精确的 ACS 更快，并且与以前发表的线性时间贪婪启发式方法相比，更接近精确的 ACS 值。我们使用包含 DNA 和蛋白质序列的四个真实数据集来评估我们的算法在准确性、运行时间方面的表现，并展示其在系统发育重建方面的适用性。我们的算法提供了比以前发表的启发式方法更好的准确性，同时在应用于系统发育重建方面也具有可比性。

结论

我们的方法对 ACS 进行了更好的近似，并且可以以极具竞争力的速度应用于生物序列的无比对比较。该算法是用 Rust 编程语言实现的，源代码可在 https://github.com/srirampc/adyar-rs 上获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3158/7672814/7b609454fbee/12859_2020_3738_Fig1_HTML.jpg

相似文献

An alignment-free heuristic for fast sequence comparisons with applications to phylogeny reconstruction.一种基于比对的启发式算法，用于快速的序列比对，可应用于系统发育重建。

BMC Bioinformatics. 2020 Nov 18;21(Suppl 6):404. doi: 10.1186/s12859-020-03738-5.

A Provably Efficient Algorithm for the k-Mismatch Average Common Substring Problem.一种用于解决k错配平均公共子串问题的可证明高效算法。

J Comput Biol. 2016 Jun;23(6):472-82. doi: 10.1089/cmb.2015.0235. Epub 2016 Apr 8.

A greedy alignment-free distance estimator for phylogenetic inference.一种用于系统发育推断的贪婪无比对距离估计器。

BMC Bioinformatics. 2017 Jun 7;18(Suppl 8):238. doi: 10.1186/s12859-017-1658-0.

Kmacs: the k-mismatch average common substring approach to alignment-free sequence comparison.Kmacs：一种无比对的序列比对方法，通过 k-错配平均公共子串实现。

Bioinformatics. 2014 Jul 15;30(14):2000-8. doi: 10.1093/bioinformatics/btu331. Epub 2014 May 13.

ALFRED: A Practical Method for Alignment-Free Distance Computation.阿尔弗雷德：一种无比对距离计算的实用方法。

J Comput Biol. 2016 Jun;23(6):452-60. doi: 10.1089/cmb.2015.0217. Epub 2016 May 3.

Vargas: heuristic-free alignment for assessing linear and graph read aligners.瓦尔加斯：用于评估线性和图形读取对齐程序的无启发式对齐。

Bioinformatics. 2020 Jun 1;36(12):3712-3718. doi: 10.1093/bioinformatics/btaa265.

SATe-II: very fast and accurate simultaneous estimation of multiple sequence alignments and phylogenetic trees.SATe-II：一种非常快速且准确的同时估计多个序列比对和系统发育树的方法。

Syst Biol. 2012 Jan;61(1):90-106. doi: 10.1093/sysbio/syr095. Epub 2011 Dec 1.

On the quality of tree-based protein classification.论基于树的蛋白质分类的质量。

Bioinformatics. 2005 May 1;21(9):1876-90. doi: 10.1093/bioinformatics/bti244. Epub 2005 Jan 12.

WMSA 2: a multiple DNA/RNA sequence alignment tool implemented with accurate progressive mode and a fast win-win mode combining the center star and progressive strategies.WMSA 2：一种采用精确渐进模式和快速双赢模式（结合中心星和渐进策略）的多 DNA/RNA 序列比对工具。

Brief Bioinform. 2023 Jul 20;24(4). doi: 10.1093/bib/bbad190.

Extraction of high quality k-words for alignment-free sequence comparison.用于无比对序列比较的高质量k词提取。

J Theor Biol. 2014 Oct 7;358:31-51. doi: 10.1016/j.jtbi.2014.05.016. Epub 2014 May 17.

本文引用的文献

Alignment-free sequence comparison: benefits, applications, and tools.无比对信息的序列比对：优势、应用和工具。

Genome Biol. 2017 Oct 3;18(1):186. doi: 10.1186/s13059-017-1319-7.

A greedy alignment-free distance estimator for phylogenetic inference.一种用于系统发育推断的贪婪无比对距离估计器。

BMC Bioinformatics. 2017 Jun 7;18(Suppl 8):238. doi: 10.1186/s12859-017-1658-0.

CAFE: aCcelerated Alignment-FrEe sequence analysis.CAFE：加速无比对序列分析。

Nucleic Acids Res. 2017 Jul 3;45(W1):W554-W559. doi: 10.1093/nar/gkx351.

ALFRED: A Practical Method for Alignment-Free Distance Computation.阿尔弗雷德：一种无比对距离计算的实用方法。

J Comput Biol. 2016 Jun;23(6):452-60. doi: 10.1089/cmb.2015.0217. Epub 2016 May 3.

MissMax: alignment-free sequence comparison with mismatches through filtering and heuristics.MissMax：通过过滤和启发式方法进行带错配的无比对序列比较。

Algorithms Mol Biol. 2016 Apr 21;11:6. doi: 10.1186/s13015-016-0072-x. eCollection 2016.

Spaced words and kmacs: fast alignment-free sequence comparison based on inexact word matches.空格词和 kmacs：基于不精确词匹配的快速无对齐序列比较。

Nucleic Acids Res. 2014 Jul;42(Web Server issue):W7-11. doi: 10.1093/nar/gku398. Epub 2014 May 14.

Kmacs: the k-mismatch average common substring approach to alignment-free sequence comparison.Kmacs：一种无比对的序列比对方法，通过 k-错配平均公共子串实现。

Bioinformatics. 2014 Jul 15;30(14):2000-8. doi: 10.1093/bioinformatics/btu331. Epub 2014 May 13.

Co-phylog: an assembly-free phylogenomic approach for closely related organisms.共进化基因组分析：一种用于近缘生物的无需组装的基因组系统发生方法。

Nucleic Acids Res. 2013 Apr;41(7):e75. doi: 10.1093/nar/gkt003. Epub 2013 Jan 18.

A phylogenetic analysis of the brassicales clade based on an alignment-free sequence comparison method.基于无比对序列比较方法的芸薹族系统发育分析。

Front Plant Sci. 2012 Aug 29;3:192. doi: 10.3389/fpls.2012.00192. eCollection 2012.

Genome characteristics of a generalist marine bacterial lineage.海洋细菌谱系的基因组特征。

ISME J. 2010 Jun;4(6):784-98. doi: 10.1038/ismej.2009.150. Epub 2010 Jan 14.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

一种基于比对的启发式算法，用于快速的序列比对，可应用于系统发育重建。

An alignment-free heuristic for fast sequence comparisons with applications to phylogeny reconstruction.

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

本文引用的文献