College of Computer, National University of Defense Technology, Changsha 410073, China.
National Supercomputing Center of Guangzhou, Guangzhou 510006, China.
Molecules. 2017 Dec 1;22(12):2116. doi: 10.3390/molecules22122116.
Big data, cloud computing, and high-performance computing (HPC) are at the verge of convergence. Cloud computing is already playing an active part in big data processing with the help of big data frameworks like Hadoop and Spark. The recent upsurge of high-performance computing in China provides extra possibilities and capacity to address the challenges associated with big data. In this paper, we propose Orion-a big data interface on the Tianhe-2 supercomputer-to enable big data applications to run on Tianhe-2 via a single command or a shell script. Orion supports multiple users, and each user can launch multiple tasks. It minimizes the effort needed to initiate big data applications on the Tianhe-2 supercomputer via automated configuration. Orion follows the "allocate-when-needed" paradigm, and it avoids the idle occupation of computational resources. We tested the utility and performance of Orion using a big genomic dataset and achieved a satisfactory performance on Tianhe-2 with very few modifications to existing applications that were implemented in Hadoop/Spark. In summary, Orion provides a practical and economical interface for big data processing on Tianhe-2.
大数据、云计算和高性能计算 (HPC) 正处于融合的边缘。在大数据框架(如 Hadoop 和 Spark)的帮助下,云计算已经在大数据处理中发挥了积极的作用。中国最近高性能计算的兴起为解决大数据相关挑战提供了额外的可能性和能力。在本文中,我们提出了 Orion——一种在天河-2 超级计算机上的大数据接口,以使大数据应用程序能够通过单个命令或 shell 脚本在天河-2 上运行。Orion 支持多个用户,每个用户可以启动多个任务。它通过自动化配置最小化了在天河-2 超级计算机上启动大数据应用程序所需的工作量。Orion 遵循“按需分配”的范例,避免了计算资源的闲置占用。我们使用一个大型基因组数据集测试了 Orion 的实用性和性能,并在对现有的基于 Hadoop/Spark 的应用程序进行很少修改的情况下,在天河-2 上获得了令人满意的性能。总之,Orion 为在天河-2 上进行大数据处理提供了一种实用且经济的接口。