Piotrowski M, McGilvary G A, Sloan T M, Mewissen M, Lloyd A D, Forster T, Mitchell L, Ghazal P, Hill J
EPCC, The University of Edinburgh, Edinburgh, United Kingdom.
Methods Inf Med. 2013;52(1):80-90. doi: 10.3414/ME11-02-0039. Epub 2012 Dec 7.
Advances in DNA Microarray devices and next-generation massively parallel DNA sequencing platforms have led to an exponential growth in data availability but the arising opportunities require adequate computing resources. High Performance Computing (HPC) in the Cloud offers an affordable way of meeting this need.
Bioconductor, a popular tool for high-throughput genomic data analysis, is distributed as add-on modules for the R statistical programming language but R has no native capabilities for exploiting multi-processor architectures. SPRINT is an R package that enables easy access to HPC for genomics researchers. This paper investigates: setting up and running SPRINT-enabled genomic analyses on Amazon's Elastic Compute Cloud (EC2), the advantages of submitting applications to EC2 from different parts of the world and, if resource underutilization can improve application performance.
The SPRINT parallel implementations of correlation, permutation testing, partitioning around medoids and the multi-purpose papply have been benchmarked on data sets of various size on Amazon EC2. Jobs have been submitted from both the UK and Thailand to investigate monetary differences.
It is possible to obtain good, scalable performance but the level of improvement is dependent upon the nature of the algorithm. Resource underutilization can further improve the time to result. End-user's location impacts on costs due to factors such as local taxation.
Although not designed to satisfy HPC requirements, Amazon EC2 and cloud computing in general provides an interesting alternative and provides new possibilities for smaller organisations with limited funds.
DNA微阵列设备和下一代大规模并行DNA测序平台的进展已使数据可用性呈指数级增长,但随之而来的机遇需要足够的计算资源。云端的高性能计算(HPC)提供了一种经济实惠的方式来满足这一需求。
Bioconductor是用于高通量基因组数据分析的流行工具,作为R统计编程语言的附加模块进行分发,但R本身没有利用多处理器架构的能力。SPRINT是一个R包,可让基因组学研究人员轻松访问HPC。本文研究了:在亚马逊弹性计算云(EC2)上设置和运行启用SPRINT的基因组分析、从世界不同地区向EC2提交应用程序的优势,以及资源利用不足是否可以提高应用程序性能。
已在亚马逊EC2上对各种大小数据集的相关性、排列检验、围绕中心点的划分以及多功能papply的SPRINT并行实现进行了基准测试。已从英国和泰国提交作业以调查货币差异。
可以获得良好的可扩展性能,但改进程度取决于算法的性质。资源利用不足可以进一步缩短获得结果的时间。由于地方税收等因素,最终用户的位置会影响成本。
尽管并非旨在满足HPC要求,但亚马逊EC2和一般的云计算提供了一个有趣的替代方案,并为资金有限的较小组织提供了新的可能性。