Center for Biomedical Informatics, Harvard Medical School, Boston, MA 02115, USA.
BMC Bioinformatics. 2010 May 18;11:259. doi: 10.1186/1471-2105-11-259.
Large comparative genomics studies and tools are becoming increasingly more compute-expensive as the number of available genome sequences continues to rise. The capacity and cost of local computing infrastructures are likely to become prohibitive with the increase, especially as the breadth of questions continues to rise. Alternative computing architectures, in particular cloud computing environments, may help alleviate this increasing pressure and enable fast, large-scale, and cost-effective comparative genomics strategies going forward. To test this, we redesigned a typical comparative genomics algorithm, the reciprocal smallest distance algorithm (RSD), to run within Amazon's Elastic Computing Cloud (EC2). We then employed the RSD-cloud for ortholog calculations across a wide selection of fully sequenced genomes.
We ran more than 300,000 RSD-cloud processes within the EC2. These jobs were farmed simultaneously to 100 high capacity compute nodes using the Amazon Web Service Elastic Map Reduce and included a wide mix of large and small genomes. The total computation time took just under 70 hours and cost a total of $6,302 USD.
The effort to transform existing comparative genomics algorithms from local compute infrastructures is not trivial. However, the speed and flexibility of cloud computing environments provides a substantial boost with manageable cost. The procedure designed to transform the RSD algorithm into a cloud-ready application is readily adaptable to similar comparative genomics problems.
随着可用基因组序列数量的不断增加,大规模比较基因组学研究和工具变得越来越耗费计算资源。随着问题的广度不断增加,本地计算基础设施的容量和成本可能会变得过高。替代计算架构,特别是云计算环境,可能有助于缓解这种不断增加的压力,并实现快速、大规模和具有成本效益的比较基因组学策略。为了验证这一点,我们重新设计了一种典型的比较基因组学算法,即倒数最小距离算法(RSD),使其在亚马逊的弹性计算云(EC2)中运行。然后,我们使用 RSD-cloud 对广泛选择的完全测序基因组进行了直系同源计算。
我们在 EC2 中运行了超过 300,000 个 RSD-cloud 进程。这些作业使用亚马逊网络服务弹性映射减少并行到 100 个高容量计算节点上进行,其中包括各种大小的基因组。总计算时间不到 70 小时,总成本为 6302 美元。
将现有的比较基因组学算法从本地计算基础设施转换的工作并不简单。然而,云计算环境的速度和灵活性提供了实质性的提升,同时成本可控。将 RSD 算法转换为云就绪应用程序的过程很容易适应类似的比较基因组学问题。