Suppr超能文献

EPA-ng:大规模并行遗传序列布局进化。

EPA-ng: Massively Parallel Evolutionary Placement of Genetic Sequences.

机构信息

Heidelberg Institute for Theoretical Studies, Schloss-Wolfsbrunnenweg 35, 69118 Heidelberg, Germany.

Department of Computer Engineering, University of A Coruña, 15071 A Coruña, Spain.

出版信息

Syst Biol. 2019 Mar 1;68(2):365-369. doi: 10.1093/sysbio/syy054.

Abstract

Next generation sequencing (NGS) technologies have led to a ubiquity of molecular sequence data. This data avalanche is particularly challenging in metagenetics, which focuses on taxonomic identification of sequences obtained from diverse microbial environments. Phylogenetic placement methods determine how these sequences fit into an evolutionary context. Previous implementations of phylogenetic placement algorithms, such as the evolutionary placement algorithm (EPA) included in RAxML, or PPLACER, are being increasingly used for this purpose. However, due to the steady progress in NGS technologies, the current implementations face substantial scalability limitations. Herein, we present EPA-NG, a complete reimplementation of the EPA that is substantially faster, offers a distributed memory parallelization, and integrates concepts from both, RAxML-EPA and PPLACER. EPA-NG can be executed on standard shared memory, as well as on distributed memory systems (e.g., computing clusters). To demonstrate the scalability of EPA-NG, we placed $1$ billion metagenetic reads from the Tara Oceans Project onto a reference tree with 3748 taxa in just under $7$ h, using 2048 cores. Our performance assessment shows that EPA-NG outperforms RAxML-EPA and PPLACER by up to a factor of $30$ in sequential execution mode, while attaining comparable parallel efficiency on shared memory systems. We further show that the distributed memory parallelization of EPA-NG scales well up to 2048 cores. EPA-NG is available under the AGPLv3 license: https://github.com/Pbdas/epa-ng.

摘要

下一代测序(NGS)技术已经导致分子序列数据的普及。这种数据雪崩在侧重于从各种微生物环境中获得的序列的分类鉴定的宏基因组学中尤其具有挑战性。系统发育定位方法确定这些序列如何适应进化背景。以前实现的系统发育定位算法,如 RAxML 中包含的进化定位算法(EPA)或 PPLACER,越来越多地用于此目的。然而,由于 NGS 技术的稳步进步,当前的实现面临着相当大的可扩展性限制。在此,我们提出了 EPA-NG,这是 EPA 的完整重新实现,速度更快,提供分布式内存并行化,并集成了 RAxML-EPA 和 PPLACER 的概念。EPA-NG 可以在标准共享内存上执行,也可以在分布式内存系统(例如计算集群)上执行。为了展示 EPA-NG 的可扩展性,我们在不到 7 小时的时间内,使用 2048 个内核,将来自 Tara Oceans 项目的 10 亿个宏基因组读取放置在一个包含 3748 个分类单元的参考树中。我们的性能评估表明,在顺序执行模式下,EPA-NG 的性能比 RAxML-EPA 和 PPLACER 高出高达 30 倍,而在共享内存系统上实现了相当的并行效率。我们进一步表明,EPA-NG 的分布式内存并行化可扩展到 2048 个内核。EPA-NG 可在 AGPLv3 许可证下获得:https://github.com/Pbdas/epa-ng。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d047/6368480/e0d60e693364/syy054f1.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验