Wan Shixiang, Zou Quan
School of Computer Science and Technology, Tianjin University, Tianjin, China.
Guangdong Province Key Laboratory of Popular High Performance Computers, Shenzhen University, Shenzhen, China.
Algorithms Mol Biol. 2017 Sep 29;12:25. doi: 10.1186/s13015-017-0116-x. eCollection 2017.
Multiple sequence alignment (MSA) plays a key role in biological sequence analyses, especially in phylogenetic tree construction. Extreme increase in next-generation sequencing results in shortage of efficient ultra-large biological sequence alignment approaches for coping with different sequence types.
Distributed and parallel computing represents a crucial technique for accelerating ultra-large (e.g. files more than 1 GB) sequence analyses. Based on HAlign and Spark distributed computing system, we implement a highly cost-efficient and time-efficient HAlign-II tool to address ultra-large multiple biological sequence alignment and phylogenetic tree construction.
The experiments in the DNA and protein large scale data sets, which are more than 1GB files, showed that HAlign II could save time and space. It outperformed the current software tools. HAlign-II can efficiently carry out MSA and construct phylogenetic trees with ultra-large numbers of biological sequences. HAlign-II shows extremely high memory efficiency and scales well with increases in computing resource.
THAlign-II provides a user-friendly web server based on our distributed computing infrastructure. HAlign-II with open-source codes and datasets was established at http://lab.malab.cn/soft/halign.
多序列比对(MSA)在生物序列分析中起着关键作用,尤其是在系统发育树构建方面。下一代测序技术的飞速发展导致缺乏有效的超大型生物序列比对方法来处理不同类型的序列。
分布式和并行计算是加速超大型(例如超过1GB的文件)序列分析的关键技术。基于HAlign和Spark分布式计算系统,我们实现了一个高效且经济的HAlign-II工具,以解决超大型多生物序列比对和系统发育树构建问题。
在超过1GB文件大小的DNA和蛋白质大规模数据集中进行的实验表明,HAlign II可以节省时间和空间。它优于当前的软件工具。HAlign-II能够高效地进行多序列比对并使用超大量生物序列构建系统发育树。HAlign-II显示出极高的内存效率,并且随着计算资源的增加扩展性良好。
THAlign-II基于我们的分布式计算基础设施提供了一个用户友好的网络服务器。带有开源代码和数据集的HAlign-II可在http://lab.malab.cn/soft/halign上获取。