Afgan Enis, Sloggett Clare, Goonasekera Nuwan, Makunin Igor, Benson Derek, Crowe Mark, Gladman Simon, Kowsar Yousef, Pheasant Michael, Horst Ron, Lonie Andrew
Victorian Life Sciences Computation Initiative (VLSCI), University of Melbourne, Melbourne, Victoria, Australia; Department of Biology, Johns Hopkins University, Baltimore, Maryland, United States of America; Centre for Computing and Informatics (CIR), Rudjer Boskovic Institute (RBI), Zagreb, Croatia.
Victorian Life Sciences Computation Initiative (VLSCI), University of Melbourne, Melbourne, Victoria, Australia.
PLoS One. 2015 Oct 26;10(10):e0140829. doi: 10.1371/journal.pone.0140829. eCollection 2015.
Analyzing high throughput genomics data is a complex and compute intensive task, generally requiring numerous software tools and large reference data sets, tied together in successive stages of data transformation and visualisation. A computational platform enabling best practice genomics analysis ideally meets a number of requirements, including: a wide range of analysis and visualisation tools, closely linked to large user and reference data sets; workflow platform(s) enabling accessible, reproducible, portable analyses, through a flexible set of interfaces; highly available, scalable computational resources; and flexibility and versatility in the use of these resources to meet demands and expertise of a variety of users. Access to an appropriate computational platform can be a significant barrier to researchers, as establishing such a platform requires a large upfront investment in hardware, experience, and expertise.
We designed and implemented the Genomics Virtual Laboratory (GVL) as a middleware layer of machine images, cloud management tools, and online services that enable researchers to build arbitrarily sized compute clusters on demand, pre-populated with fully configured bioinformatics tools, reference datasets and workflow and visualisation options. The platform is flexible in that users can conduct analyses through web-based (Galaxy, RStudio, IPython Notebook) or command-line interfaces, and add/remove compute nodes and data resources as required. Best-practice tutorials and protocols provide a path from introductory training to practice. The GVL is available on the OpenStack-based Australian Research Cloud (http://nectar.org.au) and the Amazon Web Services cloud. The principles, implementation and build process are designed to be cloud-agnostic.
This paper provides a blueprint for the design and implementation of a cloud-based Genomics Virtual Laboratory. We discuss scope, design considerations and technical and logistical constraints, and explore the value added to the research community through the suite of services and resources provided by our implementation.
分析高通量基因组学数据是一项复杂且计算密集型的任务,通常需要众多软件工具和大型参考数据集,并在数据转换和可视化的连续阶段中将它们结合在一起。一个能够实现最佳实践基因组学分析的计算平台理想情况下应满足多项要求,包括:与大量用户和参考数据集紧密相连的广泛分析和可视化工具;通过一组灵活的接口实现可访问、可重复、可移植分析的工作流平台;高可用性、可扩展的计算资源;以及在使用这些资源时的灵活性和通用性,以满足不同用户的需求和专业知识。对于研究人员来说,使用合适的计算平台可能是一个重大障碍,因为建立这样一个平台需要在硬件、经验和专业知识方面进行大量前期投资。
我们设计并实现了基因组学虚拟实验室(GVL),作为机器镜像、云管理工具和在线服务的中间件层,使研究人员能够按需构建任意规模的计算集群,并预先配置好完整的生物信息学工具、参考数据集以及工作流和可视化选项。该平台具有灵活性,用户可以通过基于网络的(Galaxy、RStudio、IPython Notebook)或命令行界面进行分析,并根据需要添加/删除计算节点和数据资源。最佳实践教程和协议提供了从入门培训到实践的途径。GVL可在基于OpenStack的澳大利亚研究云(http://nectar.org.au)和亚马逊网络服务云上使用。其原理、实现和构建过程设计为与云无关。
本文为基于云的基因组学虚拟实验室的设计和实现提供了蓝图。我们讨论了范围、设计考虑因素以及技术和后勤限制,并通过我们实现的一系列服务和资源探索了为研究界带来的附加值。