Suppr超能文献

HAlign-II:利用分布式和并行计算实现高效的超大倍数序列比对及系统发育树重建

HAlign-II: efficient ultra-large multiple sequence alignment and phylogenetic tree reconstruction with distributed and parallel computing.

作者信息

Wan Shixiang, Zou Quan

机构信息

School of Computer Science and Technology, Tianjin University, Tianjin, China.

Guangdong Province Key Laboratory of Popular High Performance Computers, Shenzhen University, Shenzhen, China.

出版信息

Algorithms Mol Biol. 2017 Sep 29;12:25. doi: 10.1186/s13015-017-0116-x. eCollection 2017.

Abstract

BACKGROUND

Multiple sequence alignment (MSA) plays a key role in biological sequence analyses, especially in phylogenetic tree construction. Extreme increase in next-generation sequencing results in shortage of efficient ultra-large biological sequence alignment approaches for coping with different sequence types.

METHODS

Distributed and parallel computing represents a crucial technique for accelerating ultra-large (e.g. files more than 1 GB) sequence analyses. Based on HAlign and Spark distributed computing system, we implement a highly cost-efficient and time-efficient HAlign-II tool to address ultra-large multiple biological sequence alignment and phylogenetic tree construction.

RESULTS

The experiments in the DNA and protein large scale data sets, which are more than 1GB files, showed that HAlign II could save time and space. It outperformed the current software tools. HAlign-II can efficiently carry out MSA and construct phylogenetic trees with ultra-large numbers of biological sequences. HAlign-II shows extremely high memory efficiency and scales well with increases in computing resource.

CONCLUSIONS

THAlign-II provides a user-friendly web server based on our distributed computing infrastructure. HAlign-II with open-source codes and datasets was established at http://lab.malab.cn/soft/halign.

摘要

背景

多序列比对(MSA)在生物序列分析中起着关键作用,尤其是在系统发育树构建方面。下一代测序技术的飞速发展导致缺乏有效的超大型生物序列比对方法来处理不同类型的序列。

方法

分布式和并行计算是加速超大型(例如超过1GB的文件)序列分析的关键技术。基于HAlign和Spark分布式计算系统,我们实现了一个高效且经济的HAlign-II工具,以解决超大型多生物序列比对和系统发育树构建问题。

结果

在超过1GB文件大小的DNA和蛋白质大规模数据集中进行的实验表明,HAlign II可以节省时间和空间。它优于当前的软件工具。HAlign-II能够高效地进行多序列比对并使用超大量生物序列构建系统发育树。HAlign-II显示出极高的内存效率,并且随着计算资源的增加扩展性良好。

结论

THAlign-II基于我们的分布式计算基础设施提供了一个用户友好的网络服务器。带有开源代码和数据集的HAlign-II可在http://lab.malab.cn/soft/halign上获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b057/5622559/284c034c074e/13015_2017_116_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验