Joint Institute for Computational Sciences, The University of Tennessee, Oak Ridge National Laboratory, 1 Bethel Valley Rd, Oak Ridge, TN 37831-6173, USA.
BMC Bioinformatics. 2013;14 Suppl 9(Suppl 9):S3. doi: 10.1186/1471-2105-14-S9-S3. Epub 2013 Jun 28.
We are focusing specifically on fast data analysis and retrieval in bioinformatics that will have a direct impact on the quality of human health and the environment. The exponential growth of data generated in biology research, from small atoms to big ecosystems, necessitates an increasingly large computational component to perform analyses. Novel DNA sequencing technologies and complementary high-throughput approaches--such as proteomics, genomics, metabolomics, and meta-genomics--drive data-intensive bioinformatics. While individual research centers or universities could once provide for these applications, this is no longer the case. Today, only specialized national centers can deliver the level of computing resources required to meet the challenges posed by rapid data growth and the resulting computational demand. Consequently, we are developing massively parallel applications to analyze the growing flood of biological data and contribute to the rapid discovery of novel knowledge.
The efforts of previous National Science Foundation (NSF) projects provided for the generation of parallel modules for widely used bioinformatics applications on the Kraken supercomputer. We have profiled and optimized the code of some of the scientific community's most widely used desktop and small-cluster-based applications, including BLAST from the National Center for Biotechnology Information (NCBI), HMMER, and MUSCLE; scaled them to tens of thousands of cores on high-performance computing (HPC) architectures; made them robust and portable to next-generation architectures; and incorporated these parallel applications in science gateways with a web-based portal.
This paper will discuss the various developmental stages, challenges, and solutions involved in taking bioinformatics applications from the desktop to petascale with a front-end portal for very-large-scale data analysis in the life sciences.
This research will help to bridge the gap between the rate of data generation and the speed at which scientists can study this data. The ability to rapidly analyze data at such a large scale is having a significant, direct impact on science achieved by collaborators who are currently using these tools on supercomputers.
我们专注于生物信息学中的快速数据分析和检索,这将直接影响人类健康和环境的质量。生物学研究中产生的数据呈指数级增长,从小原子到大生态系统,这需要越来越大的计算组件来进行分析。新型 DNA 测序技术和互补的高通量方法,如蛋白质组学、基因组学、代谢组学和元基因组学,推动了数据密集型生物信息学的发展。虽然单个研究中心或大学曾经可以满足这些应用的需求,但现在已经不再如此。如今,只有专门的国家中心才能提供满足快速数据增长和由此产生的计算需求所带来的挑战所需的计算资源。因此,我们正在开发大规模并行应用程序,以分析不断增长的生物数据洪流,并为快速发现新的知识做出贡献。
先前国家科学基金会(NSF)项目的努力为 Kraken 超级计算机上广泛使用的生物信息学应用程序生成了并行模块。我们对一些科学界最广泛使用的桌面和小型集群应用程序的代码进行了剖析和优化,包括来自国家生物技术信息中心(NCBI)的 BLAST、HMMER 和 MUSCLE;将它们扩展到高性能计算(HPC)架构上的数万个核心;使它们在下一代架构上具有健壮性和可移植性;并将这些并行应用程序整合到具有基于 Web 的门户的科学网关中。
本文将讨论将生物信息学应用程序从桌面带到 petascale 并带有前端门户的各个开发阶段、挑战和解决方案,用于生命科学中的大规模数据分析。
这项研究将有助于弥合数据生成速度与科学家研究数据速度之间的差距。在如此大规模上快速分析数据的能力正在对当前在超级计算机上使用这些工具的合作者的科学产生重大、直接的影响。