用于进化基因组学的可扩展计算

Scalable computing for evolutionary genomics.

作者信息

Prins Pjotr, Belhachemi Dominique, Möller Steffen, Smant Geert

机构信息

Laboratory of Nematology, Wageningen University, Wageningen, The Netherlands.

出版信息

Methods Mol Biol. 2012;856:529-45. doi: 10.1007/978-1-61779-585-5_22.

DOI:10.1007/978-1-61779-585-5_22

PMID:22399474

Abstract

Genomic data analysis in evolutionary biology is becoming so computationally intensive that analysis of multiple hypotheses and scenarios takes too long on a single desktop computer. In this chapter, we discuss techniques for scaling computations through parallelization of calculations, after giving a quick overview of advanced programming techniques. Unfortunately, parallel programming is difficult and requires special software design. The alternative, especially attractive for legacy software, is to introduce poor man's parallelization by running whole programs in parallel as separate processes, using job schedulers. Such pipelines are often deployed on bioinformatics computer clusters. Recent advances in PC virtualization have made it possible to run a full computer operating system, with all of its installed software, on top of another operating system, inside a "box," or virtual machine (VM). Such a VM can flexibly be deployed on multiple computers, in a local network, e.g., on existing desktop PCs, and even in the Cloud, to create a "virtual" computer cluster. Many bioinformatics applications in evolutionary biology can be run in parallel, running processes in one or more VMs. Here, we show how a ready-made bioinformatics VM image, named BioNode, effectively creates a computing cluster, and pipeline, in a few steps. This allows researchers to scale-up computations from their desktop, using available hardware, anytime it is required. BioNode is based on Debian Linux and can run on networked PCs and in the Cloud. Over 200 bioinformatics and statistical software packages, of interest to evolutionary biology, are included, such as PAML, Muscle, MAFFT, MrBayes, and BLAST. Most of these software packages are maintained through the Debian Med project. In addition, BioNode contains convenient configuration scripts for parallelizing bioinformatics software. Where Debian Med encourages packaging free and open source bioinformatics software through one central project, BioNode encourages creating free and open source VM images, for multiple targets, through one central project. BioNode can be deployed on Windows, OSX, Linux, and in the Cloud. Next to the downloadable BioNode images, we provide tutorials online, which empower bioinformaticians to install and run BioNode in different environments, as well as information for future initiatives, on creating and building such images.

摘要

进化生物学中的基因组数据分析对计算能力的要求越来越高，以至于在单台台式计算机上分析多个假设和情景耗时过长。在本章中，我们在简要概述高级编程技术之后，将讨论通过计算并行化来扩展计算的技术。不幸的是，并行编程难度较大，需要特殊的软件设计。另一种选择，对遗留软件尤其有吸引力，是通过使用作业调度器将整个程序作为单独的进程并行运行来引入简易并行化。这样的管道通常部署在生物信息学计算机集群上。个人计算机虚拟化的最新进展使得在另一个操作系统之上的“盒子”或虚拟机（VM）中运行完整的计算机操作系统及其所有已安装软件成为可能。这样的虚拟机可以灵活地部署在本地网络中的多台计算机上，例如现有的台式个人计算机上，甚至可以部署在云端，以创建一个“虚拟”计算机集群。进化生物学中的许多生物信息学应用程序可以并行运行，在一个或多个虚拟机中运行进程。在这里，我们展示了一个名为BioNode的现成生物信息学虚拟机镜像如何通过几个步骤有效地创建一个计算集群和管道。这使研究人员能够根据需要，利用可用硬件从他们的台式计算机扩展计算能力。BioNode基于Debian Linux，可以在联网的个人计算机和云端运行。其中包含200多个进化生物学感兴趣的生物信息学和统计软件包，如PAML、Muscle、MAFFT、MrBayes和BLAST。这些软件包大多通过Debian Med项目维护。此外，BioNode包含用于并行化生物信息学软件的便捷配置脚本。Debian Med鼓励通过一个中央项目打包免费和开源的生物信息学软件，而BioNode则鼓励通过一个中央项目为多个目标创建免费和开源的虚拟机镜像。BioNode可以部署在Windows、OSX、Linux以及云端。除了可下载的BioNode镜像外，我们还在线提供教程，帮助生物信息学家在不同环境中安装和运行BioNode，以及提供有关创建和构建此类镜像的未来计划的信息。