Kovatch Patricia, Costa Anthony, Giles Zachary, Fluder Eugene, Cho Hyung Min, Mazurkova Svetlana
Icahn School of Medicine at Mount Sinai, 1 Gustave L. Levy Place, New York, NY 10029, 212-241-6500.
SC Conf Proc. 2015 Nov;2015. doi: 10.1145/2807591.2807595.
As personalized medicine becomes more integrated into healthcare, the rate at which human genomes are being sequenced is rising quickly together with a concomitant acceleration in compute and storage requirements. To achieve the most effective solution for genomic workloads without re-architecting the industry-standard software, we performed a rigorous analysis of usage statistics, benchmarks and available technologies to design a system for maximum throughput. We share our experiences designing a system optimized for the "Genome Analysis ToolKit (GATK) Best Practices" whole genome DNA and RNA pipeline based on an evaluation of compute, workload and I/O characteristics. The characteristics of genomic-based workloads are vastly different from those of traditional HPC workloads, requiring different configurations of the scheduler and the I/O subsystem to achieve reliability, performance and scalability. By understanding how our researchers and clinicians work, we were able to employ techniques not only to speed up their workflow yielding improved and repeatable performance, but also to make more efficient use of storage and compute resources.
随着个性化医疗越来越融入医疗保健领域,人类基因组测序的速度迅速提高,同时对计算和存储的需求也在加速增长。为了在不重新构建行业标准软件的情况下实现针对基因组工作负载的最有效解决方案,我们对使用统计数据、基准测试和可用技术进行了严格分析,以设计一个实现最大吞吐量的系统。我们分享基于对计算、工作负载和I/O特性的评估,为“基因组分析工具包(GATK)最佳实践”全基因组DNA和RNA流程设计优化系统的经验。基于基因组的工作负载特性与传统高性能计算(HPC)工作负载的特性有很大不同,需要对调度器和I/O子系统进行不同的配置,以实现可靠性、性能和可扩展性。通过了解我们的研究人员和临床医生的工作方式,我们不仅能够采用技术来加速他们的工作流程,提高性能并使其可重复,还能更有效地利用存储和计算资源。