HAlign-II：利用分布式和并行计算实现高效的超大倍数序列比对及系统发育树重建

HAlign-II: efficient ultra-large multiple sequence alignment and phylogenetic tree reconstruction with distributed and parallel computing.

作者信息

Wan Shixiang, Zou Quan

机构信息

School of Computer Science and Technology, Tianjin University, Tianjin, China.

Guangdong Province Key Laboratory of Popular High Performance Computers, Shenzhen University, Shenzhen, China.

出版信息

Algorithms Mol Biol. 2017 Sep 29;12:25. doi: 10.1186/s13015-017-0116-x. eCollection 2017.

DOI:10.1186/s13015-017-0116-x

PMID:29026435

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5622559/

Abstract

BACKGROUND

Multiple sequence alignment (MSA) plays a key role in biological sequence analyses, especially in phylogenetic tree construction. Extreme increase in next-generation sequencing results in shortage of efficient ultra-large biological sequence alignment approaches for coping with different sequence types.

METHODS

Distributed and parallel computing represents a crucial technique for accelerating ultra-large (e.g. files more than 1 GB) sequence analyses. Based on HAlign and Spark distributed computing system, we implement a highly cost-efficient and time-efficient HAlign-II tool to address ultra-large multiple biological sequence alignment and phylogenetic tree construction.

RESULTS

The experiments in the DNA and protein large scale data sets, which are more than 1GB files, showed that HAlign II could save time and space. It outperformed the current software tools. HAlign-II can efficiently carry out MSA and construct phylogenetic trees with ultra-large numbers of biological sequences. HAlign-II shows extremely high memory efficiency and scales well with increases in computing resource.

CONCLUSIONS

THAlign-II provides a user-friendly web server based on our distributed computing infrastructure. HAlign-II with open-source codes and datasets was established at http://lab.malab.cn/soft/halign.

摘要

背景

多序列比对（MSA）在生物序列分析中起着关键作用，尤其是在系统发育树构建方面。下一代测序技术的飞速发展导致缺乏有效的超大型生物序列比对方法来处理不同类型的序列。

方法

分布式和并行计算是加速超大型（例如超过1GB的文件）序列分析的关键技术。基于HAlign和Spark分布式计算系统，我们实现了一个高效且经济的HAlign-II工具，以解决超大型多生物序列比对和系统发育树构建问题。

结果

在超过1GB文件大小的DNA和蛋白质大规模数据集中进行的实验表明，HAlign II可以节省时间和空间。它优于当前的软件工具。HAlign-II能够高效地进行多序列比对并使用超大量生物序列构建系统发育树。HAlign-II显示出极高的内存效率，并且随着计算资源的增加扩展性良好。

结论

THAlign-II基于我们的分布式计算基础设施提供了一个用户友好的网络服务器。带有开源代码和数据集的HAlign-II可在http://lab.malab.cn/soft/halign上获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b057/5622559/284c034c074e/13015_2017_116_Fig1_HTML.jpg

相似文献

HAlign-II: efficient ultra-large multiple sequence alignment and phylogenetic tree reconstruction with distributed and parallel computing.HAlign-II：利用分布式和并行计算实现高效的超大倍数序列比对及系统发育树重建

Algorithms Mol Biol. 2017 Sep 29;12:25. doi: 10.1186/s13015-017-0116-x. eCollection 2017.

SaAlign: Multiple DNA/RNA sequence alignment and phylogenetic tree construction tool for ultra-large datasets and ultra-long sequences based on suffix array.SaAlign：基于后缀数组的用于超大型数据集和超长序列的多DNA/RNA序列比对及系统发育树构建工具。

Comput Struct Biotechnol J. 2022 Mar 21;20:1487-1493. doi: 10.1016/j.csbj.2022.03.018. eCollection 2022.

HAlign 3: Fast Multiple Alignment of Ultra-Large Numbers of Similar DNA/RNA Sequences.HAlign 3：快速对齐超大量相似 DNA/RNA 序列。

Mol Biol Evol. 2022 Aug 3;39(8). doi: 10.1093/molbev/msac166.

Reconstructing evolutionary trees in parallel for massive sequences.针对海量序列并行重建进化树。

BMC Syst Biol. 2017 Dec 14;11(Suppl 6):100. doi: 10.1186/s12918-017-0476-3.

SPARK-MSNA: Efficient algorithm on Apache Spark for aligning multiple similar DNA/RNA sequences with supervised learning.SPARK-MSNA：基于 Apache Spark 的高效算法，用于通过有监督学习对齐多个相似的 DNA/RNA 序列。

Sci Rep. 2019 Apr 29;9(1):6631. doi: 10.1038/s41598-019-42966-5.

HAlign: Fast multiple similar DNA/RNA sequence alignment based on the centre star strategy.HAlign：基于中心星型策略的快速多重相似DNA/RNA序列比对

Bioinformatics. 2015 Aug 1;31(15):2475-81. doi: 10.1093/bioinformatics/btv177. Epub 2015 Mar 25.

A parallel approach to multiple sequences alignment and phylogenetic tree node labelling.一种用于多序列比对和系统发育树节点标记的并行方法。

Int J Comput Biol Drug Des. 2010;3(3):226-36. doi: 10.1504/IJCBDD.2010.038027. Epub 2011 Jan 11.

Multiple Sequence Alignment Based on a Suffix Tree and Center-Star Strategy: A Linear Method for Multiple Nucleotide Sequence Alignment on Spark Parallel Framework.基于后缀树和中心星策略的多序列比对：一种在Spark并行框架上进行多核苷酸序列比对的线性方法。

J Comput Biol. 2017 Dec;24(12):1230-1242. doi: 10.1089/cmb.2017.0040. Epub 2017 Nov 8.

PhyloGena--a user-friendly system for automated phylogenetic annotation of unknown sequences.PhyloGena——一个用于对未知序列进行自动系统发育注释的用户友好型系统。

Bioinformatics. 2007 Apr 1;23(7):793-801. doi: 10.1093/bioinformatics/btm016. Epub 2007 Mar 1.

DPRml: distributed phylogeny reconstruction by maximum likelihood.DPRml：基于最大似然法的分布式系统发育重建

Bioinformatics. 2005 Apr 1;21(7):969-74. doi: 10.1093/bioinformatics/bti100. Epub 2004 Oct 28.

引用本文的文献

HAlign 4: a new strategy for rapidly aligning millions of sequences.HAlign 4：一种快速比对数百万条序列的新策略。

Bioinformatics. 2024 Nov 28;40(12). doi: 10.1093/bioinformatics/btae718.

Deep-STP: a deep learning-based approach to predict snake toxin proteins by using word embeddings.深度序列到蛋白预测（Deep-STP）：一种基于深度学习的方法，通过词嵌入来预测蛇毒蛋白。

Front Med (Lausanne). 2024 Jan 17;10:1291352. doi: 10.3389/fmed.2023.1291352. eCollection 2023.

Identification, characterization and expression analysis of circRNA encoded by SARS-CoV-1 and SARS-CoV-2.严重急性呼吸综合征冠状病毒1型（SARS-CoV-1）和严重急性呼吸综合征冠状病毒2型（SARS-CoV-2）编码的环状RNA的鉴定、特征分析及表达分析

Brief Bioinform. 2024 Jan 22;25(2). doi: 10.1093/bib/bbad537.

Genome assembly composition of the String "ACGT" array: a review of data structure accuracy and performance challenges.字符串“ACGT”阵列的基因组组装组成：数据结构准确性和性能挑战综述

PeerJ Comput Sci. 2023 Jul 13;9:e1180. doi: 10.7717/peerj-cs.1180. eCollection 2023.

Comput Struct Biotechnol J. 2022 Mar 21;20:1487-1493. doi: 10.1016/j.csbj.2022.03.018. eCollection 2022.

Comparative genome analysis of plant ascomycete fungal pathogens with different lifestyles reveals distinctive virulence strategies.不同生活方式的植物子囊菌真菌病原体的比较基因组分析揭示了独特的毒力策略。

BMC Genomics. 2022 Jan 7;23(1):34. doi: 10.1186/s12864-021-08165-1.

Perspectives of Bioinformatics in Big Data Era.大数据时代的生物信息学展望

Curr Genomics. 2019 Feb;20(2):79-80.

Sci Rep. 2019 Apr 29;9(1):6631. doi: 10.1038/s41598-019-42966-5.

Functional Prediction of Chronic Kidney Disease Susceptibility Gene PRKAG2 by Comprehensively Bioinformatics Analysis.通过综合生物信息学分析对慢性肾脏病易感基因PRKAG2进行功能预测

Front Genet. 2018 Dec 3;9:573. doi: 10.3389/fgene.2018.00573. eCollection 2018.

本文引用的文献

CMSA: a heterogeneous CPU/GPU computing system for multiple similar RNA/DNA sequence alignment.CMSA：一种用于多个相似RNA/DNA序列比对的异构CPU/GPU计算系统。

BMC Bioinformatics. 2017 Jun 24;18(1):315. doi: 10.1186/s12859-017-1725-6.

MEGA7: Molecular Evolutionary Genetics Analysis Version 7.0 for Bigger Datasets.MEGA7：适用于更大数据集的分子进化遗传学分析版本7.0

Mol Biol Evol. 2016 Jul;33(7):1870-4. doi: 10.1093/molbev/msw054. Epub 2016 Mar 22.

HAlign: Fast multiple similar DNA/RNA sequence alignment based on the centre star strategy.HAlign：基于中心星型策略的快速多重相似DNA/RNA序列比对

Bioinformatics. 2015 Aug 1;31(15):2475-81. doi: 10.1093/bioinformatics/btv177. Epub 2015 Mar 25.

IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies.IQ-TREE：一种用于估计最大似然系统发育树的快速且有效的随机算法。

Mol Biol Evol. 2015 Jan;32(1):268-74. doi: 10.1093/molbev/msu300. Epub 2014 Nov 3.

CloudDOE: a user-friendly tool for deploying Hadoop clouds and analyzing high-throughput sequencing data with MapReduce.CloudDOE：一款用于部署Hadoop云并使用MapReduce分析高通量测序数据的用户友好型工具。

PLoS One. 2014 Jun 4;9(6):e98146. doi: 10.1371/journal.pone.0098146. eCollection 2014.

SparkSeq: fast, scalable and cloud-ready tool for the interactive genomic data analysis with nucleotide precision.SparkSeq：一种快速、可扩展且适用于云环境的工具，可实现具有核苷酸精度的交互式基因组数据分析。

Bioinformatics. 2014 Sep 15;30(18):2652-3. doi: 10.1093/bioinformatics/btu343. Epub 2014 May 19.

RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies.RAxML 版本 8：用于系统发育分析和大型系统发育后分析的工具。

Bioinformatics. 2014 May 1;30(9):1312-3. doi: 10.1093/bioinformatics/btu033. Epub 2014 Jan 21.

SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop.SeqPig：Hadoop 中用于大型测序数据集的简单且可扩展的脚本编制。

Bioinformatics. 2014 Jan 1;30(1):119-20. doi: 10.1093/bioinformatics/btt601. Epub 2013 Oct 22.

BioPig: a Hadoop-based analytic toolkit for large-scale sequence data.BioPig：一个基于 Hadoop 的大规模序列数据分析工具包。

Bioinformatics. 2013 Dec 1;29(23):3014-9. doi: 10.1093/bioinformatics/btt528. Epub 2013 Sep 10.

PhyloPhlAn is a new method for improved phylogenetic and taxonomic placement of microbes.PhyloPhlAn 是一种用于改进微生物系统发育和分类位置的新方法。

Nat Commun. 2013;4:2304. doi: 10.1038/ncomms3304.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

HAlign-II：利用分布式和并行计算实现高效的超大倍数序列比对及系统发育树重建

HAlign-II: efficient ultra-large multiple sequence alignment and phylogenetic tree reconstruction with distributed and parallel computing.

作者信息

机构信息

出版信息

BACKGROUND

METHODS

RESULTS

CONCLUSIONS

背景

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献