基于后缀树和中心星策略的多序列比对：一种在Spark并行框架上进行多核苷酸序列比对的线性方法。

Multiple Sequence Alignment Based on a Suffix Tree and Center-Star Strategy: A Linear Method for Multiple Nucleotide Sequence Alignment on Spark Parallel Framework.

作者信息

Su Wenhe, Liao Xiangke, Lu Yutong, Zou Quan, Peng Shaoliang

机构信息

1 School of Computer Science and Technology, National University of Defense Technology , Changsha, China .

2 School of Computer Science and Technology, Tianjin University , Tianjin, China .

出版信息

J Comput Biol. 2017 Dec;24(12):1230-1242. doi: 10.1089/cmb.2017.0040. Epub 2017 Nov 8.

DOI:10.1089/cmb.2017.0040

PMID:29116822

Abstract

Multiple sequence alignment (MSA) is an essential prerequisite and dominant method to deduce the biological facts from a set of molecular biological sequences. It refers to a series of algorithmic solutions for the alignment of evolutionarily related sequences while taking into account evolutionary events such as mutations, insertions, deletions, and rearrangements under certain conditions. These methods can be applied to DNA, RNA, or protein sequences. In this work, we take advantage of a center-star strategy to reduce the MSA problem to pairwise alignments, and we use a suffix tree to match identical substrings between two pairwise sequences. Multiple sequence alignment based on a suffix tree and center-star strategy (MASC) can accomplish MSA in O(mn), which is linear time complexity, where m is the number of sequences and n is the average length of sequences. Furthermore, we execute our method on the Spark-distributed parallel framework to deal with ever-increasing massive data sets. Our method is significantly faster than previous techniques, with no loss in accuracy for highly similar nucleotide sequences like homologous sequences, which we experimentally demonstrate. Comparing with mainstream MSA tools (e.g., MAFFT), MASC could finish the alignment of 67,200 sequences, longer than 10,000 bps, in 9 minutes, which takes MAFFT >3.5 days.

摘要

多序列比对（MSA）是从一组分子生物学序列中推断生物学事实的必要前提和主要方法。它指的是一系列用于比对进化相关序列的算法解决方案，同时考虑在特定条件下的进化事件，如突变、插入、缺失和重排。这些方法可应用于DNA、RNA或蛋白质序列。在这项工作中，我们利用中心星策略将多序列比对问题简化为两两比对，并使用后缀树来匹配两个两两序列之间的相同子串。基于后缀树和中心星策略的多序列比对（MASC）可以在O(mn)时间内完成多序列比对，这是线性时间复杂度，其中m是序列数量，n是序列的平均长度。此外，我们在Spark分布式并行框架上执行我们的方法来处理不断增加的海量数据集。我们的方法比以前的技术显著更快，对于高度相似的核苷酸序列（如同源序列），在准确性上没有损失，我们通过实验证明了这一点。与主流的多序列比对工具（如MAFFT）相比，MASC可以在9分钟内完成67200条长度超过10000 bp的序列的比对，而MAFFT则需要超过3.5天。

相似文献

Multiple Sequence Alignment Based on a Suffix Tree and Center-Star Strategy: A Linear Method for Multiple Nucleotide Sequence Alignment on Spark Parallel Framework.

J Comput Biol. 2017 Dec;24(12):1230-1242. doi: 10.1089/cmb.2017.0040. Epub 2017 Nov 8.

CMSA: a heterogeneous CPU/GPU computing system for multiple similar RNA/DNA sequence alignment.

BMC Bioinformatics. 2017 Jun 24;18(1):315. doi: 10.1186/s12859-017-1725-6.

SPARK-MSNA: Efficient algorithm on Apache Spark for aligning multiple similar DNA/RNA sequences with supervised learning.

Sci Rep. 2019 Apr 29;9(1):6631. doi: 10.1038/s41598-019-42966-5.

A polynomial time solvable formulation of multiple sequence alignment.

J Comput Biol. 2006 Mar;13(2):309-19. doi: 10.1089/cmb.2006.13.309.

HAlign: Fast multiple similar DNA/RNA sequence alignment based on the centre star strategy.

Bioinformatics. 2015 Aug 1;31(15):2475-81. doi: 10.1093/bioinformatics/btv177. Epub 2015 Mar 25.

Mind the gaps: evidence of bias in estimates of multiple sequence alignments.

Mol Biol Evol. 2007 Nov;24(11):2433-42. doi: 10.1093/molbev/msm176. Epub 2007 Aug 20.

A Parallel Multiobjective Metaheuristic for Multiple Sequence Alignment.

J Comput Biol. 2018 Sep;25(9):1009-1022. doi: 10.1089/cmb.2018.0031. Epub 2018 Apr 19.

Multiple alignment of DNA sequences with MAFFT.

Methods Mol Biol. 2009;537:39-64. doi: 10.1007/978-1-59745-251-9_3.

PhyPA: Phylogenetic method with pairwise sequence alignment outperforms likelihood methods in phylogenetics involving highly diverged sequences.

Mol Phylogenet Evol. 2016 Sep;102:331-43. doi: 10.1016/j.ympev.2016.07.001. Epub 2016 Jul 1.

SaAlign: Multiple DNA/RNA sequence alignment and phylogenetic tree construction tool for ultra-large datasets and ultra-long sequences based on suffix array.

Comput Struct Biotechnol J. 2022 Mar 21;20:1487-1493. doi: 10.1016/j.csbj.2022.03.018. eCollection 2022.

引用本文的文献

HAlign 3: Fast Multiple Alignment of Ultra-Large Numbers of Similar DNA/RNA Sequences.

Mol Biol Evol. 2022 Aug 3;39(8). doi: 10.1093/molbev/msac166.

Developments in Algorithms for Sequence Alignment: A Review.

Biomolecules. 2022 Apr 6;12(4):546. doi: 10.3390/biom12040546.

SaAlign: Multiple DNA/RNA sequence alignment and phylogenetic tree construction tool for ultra-large datasets and ultra-long sequences based on suffix array.

Comput Struct Biotechnol J. 2022 Mar 21;20:1487-1493. doi: 10.1016/j.csbj.2022.03.018. eCollection 2022.

SPARK-MSNA: Efficient algorithm on Apache Spark for aligning multiple similar DNA/RNA sequences with supervised learning.

Sci Rep. 2019 Apr 29;9(1):6631. doi: 10.1038/s41598-019-42966-5.

Twenty years of bioinformatics research for protease-specific substrate and cleavage site prediction: a comprehensive revisit and benchmarking of existing methods.

Brief Bioinform. 2019 Nov 27;20(6):2150-2166. doi: 10.1093/bib/bby077.

Identification of Inhibitors of MMPS Enzymes via a Novel Computational Approach.

Int J Biol Sci. 2018 May 22;14(8):863-871. doi: 10.7150/ijbs.24588. eCollection 2018.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

基于后缀树和中心星策略的多序列比对：一种在Spark并行框架上进行多核苷酸序列比对的线性方法。

Multiple Sequence Alignment Based on a Suffix Tree and Center-Star Strategy: A Linear Method for Multiple Nucleotide Sequence Alignment on Spark Parallel Framework.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献