Suppr超能文献

基于后缀树和中心星策略的多序列比对:一种在Spark并行框架上进行多核苷酸序列比对的线性方法。

Multiple Sequence Alignment Based on a Suffix Tree and Center-Star Strategy: A Linear Method for Multiple Nucleotide Sequence Alignment on Spark Parallel Framework.

作者信息

Su Wenhe, Liao Xiangke, Lu Yutong, Zou Quan, Peng Shaoliang

机构信息

1 School of Computer Science and Technology, National University of Defense Technology , Changsha, China .

2 School of Computer Science and Technology, Tianjin University , Tianjin, China .

出版信息

J Comput Biol. 2017 Dec;24(12):1230-1242. doi: 10.1089/cmb.2017.0040. Epub 2017 Nov 8.

Abstract

Multiple sequence alignment (MSA) is an essential prerequisite and dominant method to deduce the biological facts from a set of molecular biological sequences. It refers to a series of algorithmic solutions for the alignment of evolutionarily related sequences while taking into account evolutionary events such as mutations, insertions, deletions, and rearrangements under certain conditions. These methods can be applied to DNA, RNA, or protein sequences. In this work, we take advantage of a center-star strategy to reduce the MSA problem to pairwise alignments, and we use a suffix tree to match identical substrings between two pairwise sequences. Multiple sequence alignment based on a suffix tree and center-star strategy (MASC) can accomplish MSA in O(mn), which is linear time complexity, where m is the number of sequences and n is the average length of sequences. Furthermore, we execute our method on the Spark-distributed parallel framework to deal with ever-increasing massive data sets. Our method is significantly faster than previous techniques, with no loss in accuracy for highly similar nucleotide sequences like homologous sequences, which we experimentally demonstrate. Comparing with mainstream MSA tools (e.g., MAFFT), MASC could finish the alignment of 67,200 sequences, longer than 10,000 bps, in 9 minutes, which takes MAFFT >3.5 days.

摘要

多序列比对(MSA)是从一组分子生物学序列中推断生物学事实的必要前提和主要方法。它指的是一系列用于比对进化相关序列的算法解决方案,同时考虑在特定条件下的进化事件,如突变、插入、缺失和重排。这些方法可应用于DNA、RNA或蛋白质序列。在这项工作中,我们利用中心星策略将多序列比对问题简化为两两比对,并使用后缀树来匹配两个两两序列之间的相同子串。基于后缀树和中心星策略的多序列比对(MASC)可以在O(mn)时间内完成多序列比对,这是线性时间复杂度,其中m是序列数量,n是序列的平均长度。此外,我们在Spark分布式并行框架上执行我们的方法来处理不断增加的海量数据集。我们的方法比以前的技术显著更快,对于高度相似的核苷酸序列(如同源序列),在准确性上没有损失,我们通过实验证明了这一点。与主流的多序列比对工具(如MAFFT)相比,MASC可以在9分钟内完成67200条长度超过10000 bp的序列的比对,而MAFFT则需要超过3.5天。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验